Title: | Bayesian Analysis of Autocorrelated Ordered Categorical Data for Industrial Quality Monitoring. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Presents a study which discussed the relationship between explanatory variables x and an ordered categorical response y when data are collected on a production line. Full conditional structure for ordered probit models with autocorrelated latent errors; Application to real data collected on a production line in a food industry; Conclusions and discussion. |
AN: | 4699396 |
ISSN: | 0040-1706 |
Database: | Business Source Premier |
Presently available methods to analyze the link between explanatory variables and an ordered categorical response implicitly assume independence. This hypothesis is no longer valid when data are collected over time. We assume temporal dependence and introduce autocorrelation using a latent-variable formulation. Due to intractable distributions, we resort to Gibbs sampling for statistical inference within the Bayesian paradigm. Variable selection is also addressed and appears as a straightforward byproduct of this framework. We illustrate the method by analyzing on-line quality data that possess such autocorrelation.
KEY WORDS: Bayesian analysis: Dependent data: Generalized linear model: Gibbs sampling: Quality control.
The aim of this article is to design and calibrate a relationship between explanatory variables x and an ordered categorical response y when data are collected on a production line. This is particularly of interest to a quality engineer when analyzing what parameters may influence a quality criterion, especially in the food industry where low-cost sensors are now available. Myers and Montgomery (1997) emphasized the use of the generalized linear model (GLM) to treat categorical data in an industrial context. Furthermore, Chipman and Hamada (1996) recommended a particular GLM to analyze univariate ordered categorical data derived from industrial trials. This model was studied previously by Albert and Crib (1993) and will be named hereafter the ordered probit model (OP). However, in all these articles, data are mutually independent either by experimental design or from assumptions made for mathematical convenience. Because planning an experimental design in an operational environment can be complicated (and also expensive), we based our development on the use of historical data recorded on line. On-line quality-monitoring data are generally temporally dependent, so we must account for this dependence to correctly assess the relationship between x and y.
To the best of our knowledge, no such model has been studied by conventional statistical inference until now, except by Blais, McGibon, and Roy (1997), who autocorrelated count data. In the Bayesian paradigm, McCulloch and Rossi (1994) raised the problem of autocorrelation between unordered categorical data in the so-called multiperiod probit model but did not analyze it completely. The development presented here explicitly introduces autocorrelation between univariate ordered categorical responses and presents a thorough simulation-based inference from the Bayesian perspective. Furthermore, variable selection, which appears as a natural by-product of this analysis, is also addressed. To simplify the notation, we use the density notation [.|.], popularized by Gelfand and Smith (1990). Thus, joint, conditional, and marginal densities of two random quantities A and B appear, respectively, as [A, B], [A|B], and [B]. For instance, [A, B] = [A|B] x [B]. In the case of discrete events, such notation will refer to the probability distribution.
In this article, the ordered categorical response is collapsed into a simple indicator variable to indicate the observed ordered category (Fahrmeir and Tutz 1994): If the response consists of J categories, then we note y[sub t] = j with j = 1,..., J: the subscript t denotes time. OP can be defined by
(1) [Y[sub t] = j] = pi[sub tj]
with pi[sub tj] = PHI(gamma[sub j] - x[sub t]beta) - PHI(gamma[sub j-1] - x[sub t]beta), j = 1,..., J,
with pi[sub tj] being the probability of observing category j at time t; x[sub t] is the design vector containing q explanatory variables; beta is a (q x 1) vector of parameters related to x[sub t]; gamma = (gamma[sub 0],..., gamma[sub J] are unknown threshold boundaries with gamma[sub 0] < gamma[sub 1] < ... < gamma[sub J-1] < gamma[sub J]; PHI is the standard normal cumulative distribution function (cdf). For identification purposes, no constant is included in the design matrix x[sub t], and gamma[sub 0] = - Infinity, gamma[sub J] = + Infinity. An OP for an ordered categorical response with J clusters is thus defined by the parameter vector theta = (gamma, beta) of dimension (q + J - 1) and by using the normal cdf as a link function. A fruitful approach is to consider y, as the realization of a continuous random variable Z[sub t], categorized by an unknown mechanism. This latent-variable formulation was first introduced by McCullagh (1980) and Anderson and Philipps (1981) in a frequentist context. In the Bayesian context, Albert and Chib (1993) pointed out that this elegant formulation was suited to model several kinds of categorical variables and was the efficient way to perform Bayesian estimation through conditional structures of pdf's. Their basic idea was to allow the latent variables Z[sub t] to vary between unknown boundaries gamma[sub 0] < gamma[sub 1] < ... < gamma[sub J-1] < gamma[sub J] and to obtain Y[sub t] as the resulting category in which Z[sub t] is situated:
(2) Y[sub t] = j if gamma[sub j-1] < Z[sub t] < gamma[sub j], j = 1,...,J.
This means that the observation y[sub t] belongs to the category j when the latent variable Z[sub t] falls into the jth interval [gamma[sub j-1]; gamma[sub j]] (see Fig. 1). The whole development holds if J is greater than or equal to 2.
The OP is obtained when each of the latent variables Z[sub t] is mutually independently distributed as a normal variable with mean x[sub t]beta and variance 1 denoted as
Z[sub t] independent and Z[sub t] Is similar to N(x[sub t]beta, 1).
In Section 1, a slight modification of the OP assumptions is proposed to handle autocorrelated data. In Section 2, Bayesian analysis of such a modified OP is handled by considering each group of parameters conditioned on other parameters. This conditional probabilistic reasoning permits the use of the Gibbs sampler for inference. It takes full advantage of conditioning on the latent variable, as did Albert and Chib (1993) for the OP model. In Section 3, within the Bayesian context, a procedure to select influential explanatory variables using Bayes factors is proposed. Bayes factors are straightforward to compute using simulated outputs generated in Section 2. In Section 4, an application to real data collected on a production line in a food industry is given, while Section 5 provides conclusions and discussion.
Based on latent-variable distributional properties, Equation (3) can be rewritten as a linear model of Z[sub t] onto x[sub t]:
(4) [Multiple line equation(s) cannot be represented in ASCII text]
We will refer to this expression as the latent linear model (LLM) and errors epsilon[sub t] are hereafter called latent errors. Accounting for autocorrelation between successive responses is easily obtained by relaxing (4) to the assumption that latent errors follow a stationary first-order autoregressive process [AR(l)]. Bayesian analysis of such a linear model with autocorrelated errors was given by Chib (1993). We thus obtain the OP with autocorrelated latent errors (OP[sup a], hereafter) defined by (1), (2), and (4) with
(5) epsilon[sub t] = rho epsilon[sub t] + u[sub t], t = 1,..., T,
where rho is the autocorrelation parameter and the error u[sub t] is a Gaussian white noise with mean 0 and variance sigma[sup 2] = 1; that is, u[sub t] *(This character cannot be converted in ASCII text) N(0, 1). One can estimate (gamma, beta, rho), where rho is the autocorrelation parameter using multinomial likelihood from (1) with pi[sub jt] shifted to pi[sub jt] = PHI((1 - rho[sup 2]) x (gamma[sub j] - x[sub t]beta))--PHI((1 - rho[sup 2]) x (gamma[sub j-1] - x[sub t]beta)). However, the maximum likelihood (ML) estimator can be quite inaccurate, especially with a small dataset.
In this section, the latent variable is used to perform the Bayesian analysis of OP[sup a] through conditional Monte Carlo generation. Iterative generations are similar to the sampling strategies highlighted by Gelfand and Smith (1990). Let Z[sup rho, sub t] = Z[sub t] - rho Z[sub t-1](t = 1 ,..., T) be the uncorrelated latent variables and x[sup rho, sub t] = x[sub t] - rho x[sub t-1] (t = 1 ,..., T) the uncorrelated explanatory variables. In the former expression, when t = 1, Z[sub 0] appears as an unknown quantity: It will be called thereafter the initial latent variable. From this definition of Z[sup rho, sub t] and (5), it is clear that the Z[sup rho, sub t](t = 1 ,..., T) are independent and normally distributed:
(6) Z[sup rho, sub t] independent and Z[sup rho, sub t] Is similar to N(Z[sup rho, sub t]|x[sup rho, sub t]beta, 1).
2.1 Joint Parameter Posterior Distribution
In the Bayesian paradigm, the latent-variable vector and its initial level (Z[sub 0], Z) = (Z[sub 0], Z[sub 1] ,..., Z[sub T]) must be considered as an additional parameter of the model. The vector of interest remains theta = (gamma, beta, rho), but the total parameter set of the OP[sup a] is extended to (Z[sub 0], Z, theta). Bayesian analysis consists then in deriving the joint posterior probability distribution function (pdf) of (Z[sub 0], Z, theta). The posterior pdf of theta can be obtained by integrating out the nuisance parameters (Z[sub 0], Z).
Using the Bayes theorem, the joint pdf [Z[sub 0], Z, gamma, beta, rho|y] is proportional to the product of the joint pdf of (y, Z) and the prior pdf of (Z[sub 0], gamma, beta, rho):
(7) [Z[sub 0], Z, gamma, beta, rho|y] is proportional to [y|Z[sub 0], Z, gamma, beta, rho] x [Z|Z[sub 0], gamma, beta, rho] x [Z[sub 0], gamma, beta, rho] is proportional to [y, Z|Z[sub 0], gamma, beta, rho] x [Z[sub 0], gamma, beta, rho].
From (2) and (6), we have
(8) [Multiple line equation(s) cannot be represented in ASCII text],
where 1[sub A] equals 1 when A is true or 0 otherwise. The joint posterior pdf is thus
(9) [Multiple line equation(s) cannot be represented in ASCII text],
which seems at first quite intractable. However, using conjugate assumptions about the prior pdf's, all full-conditional pdf's can be expressed in an explicit form (see Appendix A). Derivation of full-conditional pdf's is quite similar to what has been done already for the OP (Albert and Chib 1993) and for the linear model with autocorrelated errors (Chib 1993; Girard and Parent 2000a).
2.2 Prior Specification
Since the likelihood function belongs to the exponential family, we assume that all prior pdf's belong to its conjugate family--namely, the normal family itself:
(10) [Z[sub 0]] = N(Z[sub 0]|a[sub 0], 1);[gamma] = N[sub J-1](gamma|gamma[sub 0], D) x 1[sub gamma[sub 1] < ... < gamma[sub J-1] < gamma[sub J-1]] [beta] = N[sub q](beta|beta[sub 0], SIGMA[sup -1, sub 0]); [rho] = N(rho|rho[sub 0], V[sub 0]) x 1[sub |rho|<1],
where rho is limited to (-1, +1[ to ensure stationarity.
Using conjugate properties (Berger 1980), all the conditional pdf's thus have forms that facilitate sampling. As noted by Chipman and Hamada (1996), the normal priors allow considerable flexibility and easy computation.
First, the initial latent variable Z[sub 0] is assumed to be normally distributed to be compatible with (3). Then, as long as beta does not contain an intercept term, a prior mean 0 for beta seems plausible. Typically, SIGMA[sub 0] is diagonal, with SIGMA[sub 0] = sigma[sup 2, sub beta] I[sub 4]as equally scaled predictors because no relevant information about the influence of the variables is known. It is also reasonable that all thresholds are independent of each other so that D is taken as a diagonal: D = diag(sigma[sup 2, sub gamma[sub j]]) (j = 1 ,..., J). Moreover, the priors for y are also justifiable because they equally allow for the consideration of a uniform case (taking sigma[sup 2, sub gamma[sub j] = 1, j = 1 ,..., J) or an uninformative case (taking sigma[sup 2, sub gamma[sub j]] arrow right +Infinity, j = 1 ,..., J). Finally, the prior for the autocorrelation term allows flexibility, and it is possible to consider either a uniform pdf on (-1, + 1[ with V[sub 0]) arrow right +Infinity or a very precise prior by setting a large mass to a highly credible value (see Sec. 4.2.3).
2.3 Expression of Full-Conditional Distributions
Since dependence of Z on y is expressed only through gamma and prior independence is assumed between the parameters' degrees of belief (10), the conditional pdf's can be readily expressed up to a normalizing constant. The full-conditional pdf's are summarized in Table 1. Theoretical developments and detailed calculations can be found in Appendix A. The Gibbs sampler can therefore be run to obtain a sample from the parameters' joint posterior pdf.
We now consider the question of comparing competing OP models. Competing models arise from different latent-error structures (with or without autocorrelated latent errors) and/or restrictions on the explanatory variables (omitting or including some variables). In this article, two different latent-error structures have been considered through OP and OP[sup a]; the model under study will be denoted as M and will appear in the conditioning.
3.1 Bayes Factors as a Basis for Model Comparison
Chipman and Hamada (1996) applied conventional significance-test methodology and computed only posterior p values for any parameters. Conversely, in this article, we rely entirely on the Bayesian paradigm, and comparisons are made on the basis of Bayes factors (Kass and Raftery 1995; Bernardo and Smith 1994). Bayes factors, which are a very general tool for model selection, do not require that alternative models be nested. In many cases, calculations can be technically simpler than classical significance tests. Finally, the well-known Bayes information criterion can be obtained as a rough asymptotic approximation of the Bayes factors on a log scale.
Bayes factors are calculated as ratios of marginal likelihood [y|M]. To handle Bayes factors for inference, one can use the table proposed by Kass and Raftery (1995), reproduced as Table 2. The log of Bayes factors is on the same scale as the deviance and likelihood ratio test statistics.
3.2 Estimation of Marginal Likelihood
For the models studied in this article, unfortunately no analytical expression of the marginal likelihood is available. In such a case, the marginal likelihood [y[M] is sometimes estimated by the Laplace method for GLM (Raftery 1996) or by evaluating the posterior harmonic mean of the likelihood (Kass and Raftery 1995). In the present case, these methods are not attractive: The former works well for smooth likelihoods only, and the latter involves the repeated evaluation of the likelihood function [y|theta, M] with draws of theta that may sooner or later occur far from high density regions, thus introducing strong perturbation into the estimated quantities. We concentrate on an alternative method developed by Chib (1995), which is based on Gibbs sampling and Rao-Blackwell density estimates.
On the computationally convenient logarithmic scale, inverting the Bayes formula leads to
(11) log[y|M] = log[y|theta, M] + log[theta|M] - log[theta|y, M],
which is estimated at a high posterior density point theta denoted as theta[sup *]. For the right side of the expression, the two first terms are evaluated at theta[sup *] and the last term log[theta|y, M] is obtained by conditional/marginal decomposition. Here, only the estimation of the marginal likelihood of the OP[sup a] is computed (for the OP, interested readers can refer to Appendix B). In this case, theta is composed of three subsets of parameters (theta = (gamma, beta, rho,)), and, using conditional decomposition, the last term can be expressed as
(12) log[theta|y, M] = log[beta|y, M] + log[rho|gamma, y, M] + log[gamma|beta, rho, y, M].
The first term of (12) is estimated using an accurate Rao-Blackwell density estimator based on Gibbs sampling with beta = beta[sup *] (see Appendix B for any details). Then, the second (last) term is estimated using the same type of estimator with a second (third) sample obtained by continuing sampling with rho set to the value rho[sup *] (beta = beta[sup *] and rho = rho[sup *]). In practice, for estimating the last term, the normalizing constant of the truncated normal pdf [gamma[sub j]|Z, Z[sub 0], beta, rho, gamma[sub j-1], y] is evaluated by a polynomial approximation available in the statistical toolbox of Matlab(R), working with a precision of 10[sup -8] [see also Chib and Greenberg (1998) for another technique based on kernel estimation].
The marginal likelihoods are estimated in this way, and the Bayes factors are computed for each comparison.
4.1 Case Study
The data arise from the quality monitoring scheme of Nestle Sweetened Condensed Milk (SCM) and were collected on a production line at the Boue factory (France) during 1997. Because most variables are not measured on line, each data record corresponds to an entire batch representing approximately 20 tons of product. The database therefore consists of a collection of 454 recordings corresponding to 454 production batches (Girard 1999).
The SCM process consists roughly of evaporating water from sweetened milk and is diagrammed in Figure 2. The quality monitoring scheme for this process was designed to monitor several quality indicators of SCM, one of which is denoted as y. This response consists of three ordered categories that empirically indicate the tendency of SCM's age-thickening--low, medium, and high. Five explanatory variables are available on line (both explanatory and response variables are measured by several inspectors). Two types of explanatory variables can be distinguished:
1. The first category is related to milk characteristics (milk dried extract x[sub 1] and milk fat percentage x[sub 2]) and therefore quality of raw material. Because the database covers a whole year, changes in milk properties must be taken into account in this analysis.
2. The second category is process variables that could influence age-thickening (Girard 1997): The pasteurization temperature denoted as x[sub 3] is the first thermic treatment applied to the milk, the temperature denoted as x[sub 4] is the final thermic treatment applied to the SCM, and the storage time denoted as x[sub 5] measures in hours mechanical treatment applied to the SCM after production.
Figure 3 shows a subset of the database used in this analysis (100 of the 454 observations to have a clear graph). The first five subplots show explanatory variables although the last subplot shows the response variables. The explanatory variables have been standardized for confidentiality reasons. The objective of the quality engineer (and subsequently our own) is to select those variables that most strongly influence age-thickening of SCM.
The SCM's producer wants the response y to be in the first category of viscosity--that is, low viscosity and hence low tendency toward age-thickening. Until now, monitoring of SCM's age-thickening was based on only empirical knowledge; previously it had been presumed that it was affected only by pasteurization temperature. Thus, foremen daily modified the pasteurization temperature according to their experience and the previously observed y.
4.2 Modeling Age-thickening by an Ordered Probit Model
As a first approach, the OP model can be used to link the age-thickening indicator v to the five explanatory variables (as described by Chipman and Hamada 1996). It is assumed that these five variables are accurate and that their influence can be expressed by a simple regression model in LLM without interaction.
Choosing linear main effects only may appear as an oversimplification, ignoring interactions and nonlinear effects. However, a model is always a simplification aiming at a delicate balance between realism and parsimony. The transfer link probit itself [or other links used in GLM; see Fahrmeir and Tutz (1994)] has a nonlinear effect on the linear combination of explanatory variables (see Fig. 1): The probability of the tth element belonging to the jth category does not vary linearly with the explanatory variables. When adding interaction between explanatory variables, no evidence in improving the fit of the model to the data was achieved, likely due to the rough discrete nature of the data, making further refinement of the hidden linear-model structure of covariation difficult to attain. As shown in what follows, introducing time correlation between residuals is a much more promising approach.
4.2.1 Case Study Prior Specification. A small amount of literature is devoted to the elicitation of prior information from operating experts' knowledge. Nevertheless, in a problem also related to viscosity, Girard and Parent (2000a) used subjective information from producers' experience obtained through a questionnaire: Prior hyperparameters were thereby estimated from a normal approximation to the distributions of responses. As we first consider OP as a reference model, we need first to specify the prior pdf's of beta and gamma given by (10). In the present application, we let prior means be 0 so that gamma[sub 0] = beta[sub 0] = 0. Specifying priors is therefore controlled by fixing only the prior variances sigma[sup 2, sub gamma[sub j]] and sigma[sup 2, sub beta]. Following the results of Chipman and Hamada (1996), who analyzed the influence of these parameters on inference, we let sigma[sup 2, sub gamma[sub j]] = 1 and sigma[sup 2, sub beta] = 1.
4.2.2 The Gibbs Sampling. The Gibbs sampling algorithm is written using Matlab(R) statistics toolbox, and all the following results are issued from runs on a 200-MHz PC with 32 Mb RAM. Convergence of the simulated chain was assessed following the recommendation of Gelman, Carlin, Stem, and Rubin (1995): Convergence is monitored by generating several parallel sequences of arbitrary length and then computing a statistic, denoted R[sup 1/2, sub psi] (see Gelman et al. 1995, p. 331). This statistic is viewed as a ratio of total variability to the "within" variability of a simulated chain of psi, an estimand of interest. Convergence is attained as R[sup 1/2, sub psi] tends to 1. In practice, Gelman proposed (Kass, Carlin, Gelman, and Neal 1998) that R[sup 1/2, sub psi] should be less than 1.2.
Due to ergodic properties, only one single chain is run to derive the joint posterior pdf. The first M draws are discarded, and the N last ones form the desired pseudosample. The values M = 200 and N = 5,000 are determined through previous experiments--M using graphical monitoring to convergence stabilization (not reported here) and N using the statistic R[sup 1/2, sub psi] with psi Is equivalent to beta or gamma[sub j], j = 1 ,..., J - 1--(see Table 3). The numbers approach closely those given by Albert and Chib (1993), McCulloch and Rossi (1994), Chipman and Hamada (1996), and Chib and Greenberg (1998).
4.2.3 Latent Residuals Posterior Diagnosis. Checking model assumptions can be a hard task in a conventional perspective. As noted by Albert and Chib (1995), one can easily check a model by using the so-called Bayesian residuals calculated from Equation (4) [see also Gelman et al. (1995), and Carlin and Louis (1996) for general use of Bayesian residuals analysis in model building].
The independence of latent errors can be checked by estimating the autocorrelogram (Box and Jenkins 1976). Moreover, in a simulation context, one can simply obtain a sample of the first-order autocorrelation of the latent error epsilon[sub t] denoted {r[sup (g)], g = 1,...,N} with r[sup (g)] = SIGMA[sup T, sub t=2]epsilon[sup (g), sub t-1]epsilon[sup (g), sub t]/SIGMA[sup T, sub i=2]epsilon[sup (g)2, sub t-1], where epsilon[sup (g), sub t] is computed from (4) and a Gibbs sample {theta[sup (g)], g = 1 ,..., N}. The autocorrelogram, the first-order autocorrelation, and the normality check of latent errors on the posterior mean of the latent errors are shown, respectively, in Figure 4, (a), (b), and (c). Figure 4, (a) and (b), leads in favor of autocorrelation. Following Box-Jenkins methodology, we should try to model the errors by an AR (1) or, equivalently, redesign the model as an OP[sup a]. The suitability of such a modeling option will be confirmed later on the basis of Bayes factors.
Prior specification must then be added for the additional parameter p. In a parallel study on the same milk process, Girard and Parent (2000a) used data from the line base to model a continuous response and exhibited autocorrelated errors in the linear model. The autocorrelation term had a mean of approximately .7 and a standard deviation close to .05. These values are then used as prior specification for the OP[sup a] parameters, and the OP[sup a] is analyzed as indicated in Section 2.
4.2.4 Comparing Models With the Bayes Factors. Models that include or omit some variables will be indexed hereafter by the indicator vector xi, which reflects which explanatory variables are included in the model. With this notation, under the model OP[sub xi], the distribution of y depends on the explanatory variables X[sub xi] through the definition (1) and beta[sub xi]. X[sub xi] corresponds to the columns of X with xi = 1 for which the coordinates of xi are 1.
Considering the five variables and two latent-error structures (OP and O[sup a]) leads to 2 x (2[sup 5] - l)= 62 competing models; there are therefore [Multiple line equation(s) cannot be represented in ASCII text] possible comparisons. Because of the large number of comparisons, we report in Table 4 only those models whose log Bayes factors against the higher prior predictive pdf model is contained in the range [-10, 10]. Thus, models neglected in Table 4 are, with respect to Bayes factor, 10 orders of magnitude less believable than the best choice. This methodology was called Occam's window by Kass and Raftery (1995).
As shown in Table 4, Bayes factors in favor of model OP[sup a, sub (0,0,1,0,1)] indicate positive to strong evidence against all other models: Therefore, we will adopt an OP[sup a] with only pasteurization temperature (x[sub 3]) and storage time (x[sub 5]) as significant influential variables. Preliminary trials on the prior variances sigma[sub beta] and sigma[sub gamma] varying in a small range around their initial values show no significant change in our results. A more systematic and complete comparative study should be performed since Bayes factors are known to be highly sensitive to priors.
The composite Bayes factor (Bernardo and Smith 1994) helps to test the hypothesis of existence of autocorrelation between categorical responses. Indeed, the Bayes factor relative to H[sub 0]: "existence of first-order autocorrelation" against its converse H[sub 1] "no autocorrelation or autocorrelation of higher order," denoted B[sub 01], is computed as follows:
[Multiple line equation(s) cannot be represented in ASCII text],
where M denotes the finite set of models under study--that is, {OP[sub xi]; OP[sup a, sub xi]; xi is an element of {0; 1}[sup q]}. With the SCM data, the Bayes factor for existence of autocorrelation, B[sub 01], is 4.3 (see Table 4), indicating a positive evidence in favor of autocorrelation. We thus conclude that autocorrelation should be introduced into the model for time dependence between data collected on line.
4.2.5 Posterior Marginal Distributions. The relationship of the response y to all variables X is affected when accounting for autocorrelation as illustrated in Figure 5. For instance, beta[sub 1], related to the first explanatory variable (Fig. 5a), was quite close to 0 under the initial model OP. Under the modified model OP[sup a] the value 0 becomes less credible. The opposite situation holds for with beta[sub 4](Fig. 5d).
We can also observe that all marginal densities flatten out when autocorrelation between latent errors is introduced. Intuitively, one can think that the coefficients beta related to explanatory variables in the OP[sup a] model are more free to explore potential zones of variation than those in the OP model.
If one wants to ascertain the specific influence of an explanatory variable, Figure 5 should not be overly interpreted and conclusions should not be made only on the basis of Bayes factors. From the previous paragraph, OP[sup a, sub (0,0,1,0,1)] was concluded to be the model that best supports the data from SCM quality monitoring and is the one chosen for subsequent interpretation. The marginal pdf of predictors related to the two selected explanatory variables and the marginal pdf of the autocorrelation term of the selected model are sketched in Figure 6. Suppose that most of the mass for the pdf of beta'[sub i] is concentrated along the negative abscissa. When x[sub t] increases, the normal distribution centered at x[sub t]beta shifts to the left side, and therefore the probability pi[sub tj] decreases for high j and increases for low j (see also Fig. 1). In this application, increasing the two variables lowers the tendency of age-thickening. The influence of the second variable, the storage time, is relatively easy to understand: The mechanistic action exerted on the SCM during storage breaks down the milk protein links. The instant viscosity therefore decreases as does the tendency of age-thickening (see Girard 1997). Conversely, the intrinsic interpretation of the action of the pasteurization temperature is not straightforward and will need further research on milk properties.
This study was a first step toward better statistical control with on-line data. Finally, based on the OP model, the quality engineer would recommend that the producer increase both pasteurization temperature and storage time in order to lower the age-thickening tendency. Had the control variables been set out of their ordinary working range for the SCM process, it would be necessary to plan a further experimental design to explore more precisely the influence of explanatory factors.
When a quality engineer wants to study the links between ordered categorical data and explanatory variables, the OP model appears to be appropriate. However, if data are collected on line, the temporal dependence should be taken into account and subsequent alternative models should be compared. The Bayesian approach outlined in this article is an attempt to develop such a line of reasoning. The model presented here can also be viewed as an extension of the OP model to treat dependent data.
The following conclusions have been attained:
1. The OP[sup a] model is particularly suitable to model the influence of explanatory variables on an ordered categorical response with data collected on line with a temporal structure. Furthermore, even if the model structure is quite complex, quality engineers can easily interpret the model through the regressor coefficients beta in the LLM (4). Finally, beta pdf's provide meaningful information to complete the analysis. Conclusions can be different from empirical knowledge as illustrated previously.
2. Most of the current methods for analyzing ordered categorical data encounter problems. For instance, ML estimators are known to be quite inaccurate especially with small datasets. Furthermore, as compared to conventional statistics. Bayesian inference does not rely on asymptotic properties and can therefore be performed even with minimal data. The Bayesian analysis procedure proposed in this article follows the recent developments of applied Bayesian statistics, allowing, by simulation, a sample to be obtained from the exact posterior pdf. The accuracy of estimates is then limited by the computing resources. Of course, it is also limited (and maybe more so) by the extent and the accuracy of the prior beliefs of the researcher. This method is also accessible to the average statistical practitioner. Another advantage of this simulation-based inference approach includes numerous computing schemes for posterior diagnosis such as the computation of the first-order autocorrelation pdf (conversely, conventional statistics are unable to express this pdf).
3. Bayes factors, which are straightforwardly estimated via Gibbs sampling, provide a simple way to compare models even when they are not nested. They require proper priors and are known to be highly sensitive to prior specifications. In an industrial context however, subjective information from producers' experience is often available to elicit rather informative priors and the sensitivity of the results can be checked in the neighborhood of the prior specification. Bayes factors allow evidence in favor of each alternative hypothesis to be quantified. Moreover, the Bayesian paradigm allows for comparison of hypotheses that belong to a more general problem rather than for a selection of variables as illustrated by the autocorrelation feature.
4. Introduction of the latent variable (2) is the central technical point of the development. Albert and Chib (1993) outlined in their seminal article that this point allows conditional reasoning and recourse to Gibbs sampling. This framework can be used to check model assumptions. Furthermore, latent errors are the starting point of model extension. For instance, it can be seen from Crib and Greenberg (1998) that the binary and ordered probit model approach of Albert and Chib (1993) can be extended with no theoretical difficulties to multivariate binary responses.
The latent variable has been introduced primarily for the technical reason of outlining conditioning structures. Nevertheless, in some applications, the latent variable can bear a meaningful interpretation. For instance, (a) Anderson and Philipps (1981) suggested that a regression model would hold if the response could be measured more precisely and (b) when a consumer is faced with various economic options, Geweke, Keane, and Runkle (1997) interpreted the latent variable as a utility value derived by the consumer. In a more complex dependence ease, extension of OP[sup a] to AR(p) latent errors is easy to address and follows the same line of reasoning. It is needed only to introduce p - 1 additional initial levels Z[sub -1], Z[sub -2] ,..., Z[sub -p+1] and p - 1 AR additional coefficients phi. The assumption of AR(1) by (5) is then relaxed to an AR(p) latent error, epsilon[sub t] = phi(B)epsilon[sub t] + u[sub t](t = 1 ,..., T), where phi(B) is a polynomial in the backward operator B given by phi(B) = SIGMA[sup p, sub t=1] phi[sub t]B[sup t]. The resulting (q + J + p - 1) vector parameter is theta = (gamma, beta, phi) and the total parameter (theta, Z[sub 0], Z[sub -1], Z[sub -2] ,..., Z[sub -p+1], Z) is estimated using Gibbs sampling. The only difficulties come from [Z[sub t]|Z[sub is not equal to t], theta] = [Z[sub t]]Z[sub 1] ,..., Z[sub t-1], Z[sub t+1],..., Z[sub T], theta] (for 1 less than or equal to t less than or equal to T), which cannot be expressed in a simple way.
Moreover, the conditional reasoning used in Section 2 is quite general and attractive. Thus in the OP when, working conditional on the latent variable and the boundaries gamma, an LLM (4) is exhibited, one can thus imagine transferring every known result from the linear model theory onto the LLM. For instance, one can consider an unknown variance sigma[sup 2] in (3) and use the Gibbs sampler entirely to obtain a sample from the joint posterior pdf of (Z, gamma, beta, sigma[sup 2]). This model is quite attractive in a robust perspective and allows a wide range of link functions to be considered. Moreover, extension to a hierarchical LLM is also easy to address with this conditional reasoning [this so-called conditional hierarchical independent model was presented by Albert and Chib (1997)].
Finally, the simulated sample of joint posterior pdf's is also valuable for deriving posterior predictive pdf's. These pdf's are needed to compute Bayesian risks in a decision-making context. An example of putting this idea into practice in predictive control under constraints was given by Girard and Parent (2000b).
We gratefully acknowledge Jacques Bernier for discussions on the Rao-Blackwell theorem and Christian Robert for pointing out to us that recourse to computer packages is not a disgrace. Thanks to Ruth Acheson, Marcel Baumgartner, and Lucien Duckstein for their proofreading. We are grateful to two anonymous referees and an associate editor for many helpful suggestions.
(Multiple lines cannot be converted in ASCII text).
Legend for Chart: A - B[sub j, k] B - 2 log(B[sub j, k]) C - Evidence in favor of M[sub j] A B C 0 to 3 0 to 2 Not worth more than a bare mention 3 to 20 2 to 6 Positive 20 to 150 6 to 10 Strong > 150 > 10 Very strong
Legend for Chart: A - N B - r[sup 1/2, sub PI] statistic (Gelman et al. 1995) PI gamma[sub 1] C - r[sup 1/2, sub PI] statistic (Gelman et al. 1995) PI gamma[sub 2] D - r[sup 1/2, sub PI] statistic (Gelman et al. 1995) PI beta[sub 1] E - r[sup 1/2, sub PI] statistic (Gelman et al. 1995) PI beta[sub 2] F - r[sup 1/2, sub PI] statistic (Gelman et al. 1995) PI beta[sub 3] G - r[sup 1/2, sub PI] statistic (Gelman et al. 1995) PI beta[sub 4] H - r[sup 1/2, sub PI] statistic (Gelman et al. 1995) PI beta[sub 5] A B C D E F G H 1,000 1.591 1.452 1.041 1.039 1.029 1.033 1.011 2,000 1.382 1.333 1.027 1.033 1.033 1.020 1.020 3,000 1.346 1.295 1.017 1.030 1.018 1.013 1.019 4,000 1.204 1.183 1.015 1.023 1.021 1.012 1.016 5,000 1.132 1.121 1.010 1.012 1.020 1.006 1.011
Legend for Chart: B - B[sub row, column] OP (0,1,1,1,1) C - B[sub row, column] OP (0,1,1,0,1) D - B[sub row, column] OP (0,0,1,0,1) E - B[sub row, column] OP (0,0,1,1,1) F - B[sub row, column] OP[sup a] (0,0,0,0,1) G - B[sub row, column] OP[sup a] (0,1,1,0,1) H - B[sub row, column] OP[sup a] (0,1,0,0,1) I - B[sub row, column] OP[sup a] (0,0,1,0,1) J - B[sub row, column] OP[sup a] (0,0,1,1,1) K - B[sub row, column] OP[sup a] (0,0,0,0,1) A B C D E F G H I J K OP[sub (0, 1, 1, 1, 1) -- OP[sub (0, 1, 1, 0, 1) 0.21 -- OP[sub (0, 0, 1, 0, 1) 3.18 3.39 -- OP[sub (0, 0, 1, 1, 1) 1.79 2.00 -1.39 -- OP[sub (0, 0, 0, 0, 1) -1.52 -1.31 -4.70 -3.31 -- OP[sup a, sub (0, 1, 1, 0, 1) 0.87 1.08 -2.31 0.92 2.39 -- OP[sup a, sub (0, 1, 0, 0, 1) -1.52 -1.31 -4.70 -3.31 0.00 -2.39 -- OP[sup a, sub (0, 0, 1, 0, 1) 6.94 7.15 3.76 5.15 8.46 6.07 8.46 -- OP[sup a, sub (0, 0, 1, 1, 1) -1.54 -1.33 -4.71 -3.33 0.01 -2.41 0.02 -8.47 -- OP[sup a, sub (0, 0, 0, 0, 1) 3.77 3.98 0.59 1.98 5.29 2.90 5.29 -3.17 5.31 -- OP[sup a, sub (0, 0, 0, 1, 1) -3.87 -3.66 -7.05 -5.66 -2.35 -4.74 -2.35 -10.81 -2.34 -7.64
GRAPH: Figure 1. Latent Variable to Ordered Categorical Data Correspondence
DIAGRAM: Figure 2. Process Scheme for Sweetened Condensed Milk.
GRAPHS: Figure 3. Subset of the Database Used in the Case Study.
GRAPHS: Figure 4. Posterior Diagnostics on the Latent Error of OP[sub (1,1,1,1,1)]: (a) Autocorrelogram; (b) First-Order Autocorrelation; (c) Normality Checks.
GRAPHS: Figure 5. Posterior Marginal Distributions of predictors beta for OP[sub (1,1,1,1,1)] (dotted line) versus OP[sup a, sub (1,1,1,1,1)] (solid line): (a) beta[sub 1]; (b) beta[sub 2]; (c) beta[sub 3]; (d) beta[sub 4]; (e) beta[sub 5].
GRAPHS: Figure 6. Marginal Posterior Distribution of Predictor beta and Autocorrelation Term for the Model Selected on the Basis of Bayes Factor: (a) beta[sub 1], (b) beta[sub 2], (c) rho, (d) Covariation of beta[sub 2] Versus rho.
Albert, J. H., and Chib, S. (1993), "Bayesian Analysis of Binary and Polytomous Response Data," Journal of the American Statistical Association, 88, 669-679.
----- (1995), "Bayesian Residual Analysis for Binary Response Regression Models," Biometrika, 82, 747-759.
----- (1997), "Bayesian Tests and Model Diagnostics in Conditionally Hierarchical Models," Journal of the American Statistical Association, 92, 916-925.
Anderson, J. A., and Philipps, P. R. (1981), "Regression, Discrimination and Measurement Models for Ordered Categorical Variables," Applied Statistics, 30, 22-31.
Berger, J. O. (1980), Statistical Decision Theory and Bayesian Analysis, New York: Springer-Verlag.
Bernardo, J. M., and Smith, A. F. M. (1994), Bayesian Theory, London: Wiley.
Blais, M., McGibon, B., and Roy, R. (1997), "Inference in Generalized Linear Models for Times Series of Counts," Technical Report G-97-61, University of Montreal.
Box, G. E. P., and Jenkins, G. M. (1976), Time Series Analysis: Forecasting and Control (2nd ed.), San Francisco: Holden-Day.
Carlin, B. P., and Louis, T. A. (1996), Bayes and Empirical Bayes Methods for Data Analysis, London: Chapman and Hall.
Chib, S. (1993), "Bayes Regression With Autoregressive Errors--a Gibbs Sampling Approach," Journal of Econometrics, 58, 275-294.
----- (1995), "Marginal Likelihood From the Gibbs Output," Journal of the American Statistical Association, 90, 1313-1321.
Chib, S., and Greenberg, E. (1998), "Analysis of Multivariate Probit Models," Biometrika, 85, 347-361.
Chipman, H., and Hamada, M. (1996), "Bayesian Analysis of Ordered Categorical Data From Industrial Experiments," Technometrics, 38, 1-10.
Fahrmeir, L., and Tutz, G. (1994), Multivariate Statistical Modelling Based on Generalized Linear Models, New York: Springer-Verlag.
Gelfand, A., and Smith, A. E M. (1990), "Sampling-based Approaches to Calculating Marginal Densities," Journal of the American Statistical Association, 85, 398-409.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman and Hall.
Geweke, J. E, Keane M. P., and Runkle, D. E. (1997), "Statistical Inference in the Multinomial Multiperiod Probit Model," Journal of Econometrics, 80, 125-165.
Girard, P. (1997), "Influence de la Composition Physico-chimique et du Procede de Fabrication sur la Viscosite du Lait Concentre Sucre. Synthese Bibliographique," report, Nestle Research Center, Lausanne, Switzerland.
----- (1999), "Optimisation du Suivi Operationnel de la Qualite en Usine par la Modelisation Statistique de Procedes a Partir de Donnees Recueillies sur Ligne," unpublished Ph.D. thesis, Ecole du Genie Rural des Eaux et des Forets, Paris (France).
Girard, P., and Parent, E. (2000a), "Analyse Bayesienne du Modele Lineaire a Erreur Autocorrelee: Application a la Modelisation d'un Procede Agroalimentaire a Partir de Donnees Recueillies sur Ligne," Revue de Statistique Appliquees, 41, 1-15.
----- (2000b), "The Deductive Phase of Statistical Inference via Predictive Simulations: Test, Validation and Control of a Linear Model With Autocorrelated Errors on a Food Industry Case Study," unpublished manuscript submitted to Journal of Statistical Planning and Inference.
Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M. (1998), "Markov Chain in Practice," a roundtable discussion available at http://www. amstat.org/publications/tas/kass.pdf
Kass, R. E., and Raftery, A. E. (1995), "Bayes Factors," Journal of the American Statistical Association, 90, 773-795.
Lindley, D. V., and Smith, A. F. M. (1972), "Bayes Estimates for the Linear Model," with discussion, Journal of the Royal Statistical Society, Ser. B. 34, 1-41.
McCullagh, P. (1980), "Regression Models for Ordinal Data," Journal of the Royal Statistical Society, Ser. B, 42, 109-142.
McCulloch, R., and Rossi, P. E. (1994), "An Exact Likelihood Analysis of the Multinomial Probit Model," Journal of Econometrics, 64, 207-240.
Myers, R. H., and Montgomery, D. C. (1997), "A Tutorial on Generalized Linear Models," Journal of Quality Technology, 29, 274-291.
Raftery, A. E. (1996), "Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear Models," Biometrika, 83, 251-266.
Robert, C. P. (1995), "Simulation of Truncated Normal Variables," Statistics and Computing, 5, 121-125.
For the OP[sup a], we first recall that our interest is focused on theta = (gamma, beta, rho) [see Eqs. (1), (4), and (5) for the model definition]. Latent variables (Z[sub 0], Z) are considered as nuisance parameters to be included into the parameter set for Bayesian inference. Since we consider only a partial conjugate, the posterior pdf (9) is not a standard one. However, provided that all the full conditional pdf's are available in some explicit form convenient for simulation, the posterior pdf can be sampled through the Gibbs algorithm (Gelfand and Smith 1990).
In the following developments, we first derive all full conditional pdf's and second give practical advice to run the Gibbs sampler.
A.1 Full Joint Posterior pdf and Derivation of all Full Conditional pdf's of the OP[sup a] Model
The posterior pdf reads, up to a normalizing constant, as [see Eq. (9)]
[Multiple line equation(s) cannot be represented in ASCII text],
where the prior pdf [Z[sub 0], gamma, beta, rho] is decomposed assuming a priori independence as [Z[sub 0], gamma, beta, rho] = [Z[sub 0]] x [gamma] x [beta] x [rho]. More precisely [see Eq. (10)],
[Z[sub 0] = N(Z[sub 0]|a[sub 0], 1); [gamma] = N[sub J-1](gamma|gamma[sub 0], D)1[sub gamma[sub 1] < ... < gamma[sub J-1] < gamma[sub J-1]] [beta] = N[sub q](beta|beta[sub 0], SIGMA[sup -1], sub 0]); [rho] = N(rho|rho[sub 0], V[sub 0])1[sub |rho|<1].
To obtain all full-conditional pdf's, we take full advantage of the conditional writing: Considering two quantities A and B and observations y, one can write that [A, B|y] = [A|B, y] x [B|y]. As long as we are only interested in A, the previous expression can be written as
(A.1) [A|B, y] is proportional to [A, B|gamma].
As a consequence, the conditional pdf [A|B, y] can be obtained from the posterior pdf of (A, B) by extracting from [A, B|y] only the terms including A. This strategy is used hereafter to identify all the full conditional pdf's from [Eq. (9)].
1. Replacing A by Z[sub 0] and B by (Z, gamma, beta, rho) in Equation (A.1), Equation (9) provides
[Multiple line equation(s) cannot be represented in ASCII text].
Expanding the term in the exponential of the normal pdf in the right side of the previous expression, we have (Z[sub 0] - a[sub 0])[sup 2] + (Z[sub 1] - rho Z[sub 0] -(x[sub 1] - rho x[sub 0])beta)[sup 2], where the first term in this sum comes from the first normal pdf and the second from the second normal. Rebuilding exact square terms of Z[sub 0] leads to the following pdf:
(A.2) [Multiple line equation(s) cannot be represented in ASCII text].
2. Replacing A by Z and B by (Z[sub 0], gamma, beta, rho) in Equation (A.1), Equation (9) gives
[Multiple line equation(s) cannot be represented in ASCII text].
Expanding the square term in the exponential of the normal pdf, one can distinguish two cases: (1) when t = 1 ,..., T - 1, Z[sub t] only intervenes in the exponential term with Z[sub t-1] and Z[sub t+1] so that [Multiple line equation(s) cannot be represented in ASCII text] and exact calculation following the same line of reasoning as Z[sub 0] leads to the following conditional pdf:
(A.3a) [Multiple line equation(s) cannot be represented in ASCII text],
which is a truncated normal, where Z[sub t] = (rho Z[sub t-1] + rho Z[sub t+1] + (x[sub t] - rho x[sub t-1])beta - (x[sub t+1] - rho x[sub t])beta)/(1 + rho[sup 2]) and Z[sub is not equal to t] denotes the vector (Z[sub 1],...,Z[sub t-1], Z[sub t+1],..., Z[sub T]); (2) when t = T, Z[sub T] only intervenes in the exponential term with Z[sub T-1] and we have
(A.3b) [Multiple line equation(s) cannot be represented in ASCII text].
3. Replacing A by gamma and B by (Z, Z[sub 0], beta, rho) in Equation (A.1), Equation (9) leads to
[Multiple line equation(s) cannot be represented in ASCII text].
As far as gamma is restricted by I[sub gamma] so that gamma[sub 1] < gamma[sub 2] < ... < gamma[sub J-1] and D = diag(sigma[sup 2, sub gamma[sub j]]), one can consider each gamma[sub j], j = 1 ,..., J - 1, term by term. We thus have
[Multiple line equation(s) cannot be represented in ASCII text],
which is equivalent to
(A.4) [Multiple line equation(s) cannot be represented in ASCII text],
with gamma[sup inf, sub j] = max{max{Z[sub t]: y[sub t] = j}; gamma[sub j-1]} and gamma[sup sup, sub j] = min{min x {Z[sub t]: y[sub t] = j + 1}; gamma[sub j-1]}, and where the normalizing constant can be evaluated as (Integral of[sup gamma[sup sup, sub j], sub gamma[sub inf, sub j]] N(gamma[sub j]|gamma[sub j0], sigma[sup 2, sub gamma[sub j]])d gamma[sub j])[sup -1] using the univariate normal cdf.
4. Replacing A by beta and B by (Z, Z[sub 0], gamma, rho) in Equation (A.1), Equation (9) reads as
[Multiple line equation(s) cannot be represented in ASCII text].
We use classical results of the linear model [e.g., see Lindley and Smith 1972] on the uncorrelated LLM [Eq. (6)] Z[sub p] = X[sub p]beta + u, where Z[sub p] = (Z[sup rho, sub 1], Z[sup rho, sub 2],...,Z[sup rho, sub T])' and X[sub p] = (x[sup rho, sub 1], x[sup rho, sub 2],..., x[sup rho, sub T])', and obtain
(A.5) [Multiple line equation(s) cannot be represented in ASCII text]
with
[Multiple line equation(s) cannot be represented in ASCII text].
5. Finally, replacing A by rho and B by (Z, Z[sub 0], gamma, beta) in Equation (A.1), Equation (9) leads to
[Multiple line equation(s) cannot be represented in ASCII text].
Exhibiting in the exponential terms of the normal pdf all the terms containing rho leads to the following pdf:
(A.6) [Multiple line equation(s) cannot be represented in ASCII text],
with
[Multiple line equation(s) cannot be represented in ASCII text].
A.2 Gibbs Sampling Implementation
Samples from the joint pdf [Z, Z[sub 0], beta, gamma, rho|y] are obtained by sampling iteratively from each of the full conditional pdf's (A.2) up to (A.6) in this order, while updating the conditioning parameters. Samples from the marginal pdf's [theta[sub i]|y], i = l ,..., k (where theta[sub i] Is equivalent to gamma, beta or rho), are also available when considering one parameter at a time. We focus the user's attention to the generation of latent variables Z[sub t] (1 less than or equal to t < T) and gamma[sub j], respectively, from pdf (A.3a) and (A.4), which are both truncated normals. For both, in our application, we simply generate values from a normal variable and discard all values until the simulated value belongs to the desired interval. Moreover, when generating gamma the vector of threshold boundaries, one starts by generating gamma[sub 1] up to gamma[sub J-1] by sampling gamma[sup (1), sub j] from [gamma[sub j]|Z, Z[sub 0], gamma[sup (i-1), sub j+1], gamma[sup (i), sub j-1], beta, rho, gamma, y] for (j = 1 ... J - 1).
Finally, if one wants to optimize the algorithm, one can use uniform approximation as done by Albert and Chib (1993) or recourse can be made to more advanced algorithms as given by Robert (1995).
The method using Bayes factors requires the computation of the marginal likelihood [y|M] of all models (Kass and Raftery 1995). As long as posterior pdf of OP[sup a] or OP is not a standard one, the corresponding marginal likelihood is not known in explicit form and we need to estimate it. Among others, Chib's technique is of interest because it is a by-product of Gibbs sampling.
Let us remember that, in the OP[sup a] model, the last term of (11) becomes Equation (12), which reads as
log[theta|y, M] = log[beta|y, M] + log[rho|gamma, y, M] + log[gamma|beta, rho, y, M] when M Is equivalent to OP[sup a]
or
log[theta|y, M] = log[gamma|beta, y, M] + log[beta|y, M] when M Is equivalent to OP,
where the previous terms will be estimated in the following cases. To simplify notations, reference to the model is dropped.
B.1 The OP[sup a] Case
From the first Gibbs run, the posterior marginal pdf [beta|y] can be estimated using a Rao-Blackwellized estimator of density (Gelfand and Smith 1990), since full-conditional pdf's are for the Gibbs sampling (Appendix A) as
(B.1) [Multiple line equation(s) cannot be represented in ASCII text],
which is evaluated at beta = beta[sup *], beta[sup *] being a high posterior point, such as the posterior mode or mean.
Next consider the estimation of [rho[sup *]|beta[sup *], y] = Integral of[rho[sup *]|Z, Z[sub 0], gamma, beta[sup *], y][Z, Z[sub 0], gamma|beta[sup *], y]dZdZ[sub 0]d gamma, where [rho|Z, Z[sub 0], gamma, beta, y] is (A.6). A key point is that this integral can be estimated very accurately by drawing a large sample of (Z, Z[sub 0], gamma) values from the density [Z, Z[sub 0], gamma|beta[sup *], y]. As shown by Chib (1995), a sample of (Z, Z[sub 0], gamma) is produced from a reduced Gibbs sampling run consisting of the pdf's [Z|beta[sup *], Z[sub 0], gamma, rho, y], [Z[sub 0]|Z, rho, beta[sup *], gamma, y], [rho|Z[sub 0], beta[sup *], gamma, y], and [gamma|Z, Z[sub 0], beta[sup *], rho, y], which are the same pdf's used in the first run but with beta fixed at beta[sup *]. Provided this sample of N values (Z, Z[sub 0], gamma), an estimate of [rho|beta[sup *], y] is available as
(B.2) [Multiple line equation(s) cannot be represented in ASCII text],
which is evaluated at rho = rho[sup*].
Finally, consider the estimation of [gamma[sup *]|rho[sup *], beta[sup *], y]. Keeping on this strategy, a sample of (Z, Z[sub 0]) is produced from a third reduced Gibbs sampling using [Z|Z[sub 0], gamma, beta[sup *], rho[sup *], y], [Z[sub 0]|Z, gamma, beta[sup *], rho[sup *], y], and [gamma|Z, Z[sub 0], beta[sup *], rho[sup *], y], which are the same pdf's used in the first run but with beta fixed at beta[sup *] and rho at rho[sup *]. In our case study, we only consider a three-classes OP[sup a] model, but all this calculation can be extended without any difficulties by following the same line of reasoning. Here, gamma has only two components, since gamma[sub 0] = -Infinity < gamma[sub 1] < gamma[sub 2] < gamma[sub 3] = +Infinity. Moreover, since log[gamma|beta, rho, y], y] = log[gamma[sub 2]|gamma[sub 1], beta, rho, y] + log;[gamma[sub 1], beta, rho, y], we estimate separately these two components, respectively, by
(B.3) [Multiple line equation(s) cannot be represented in ASCII text]
and
(B.4) [Multiple line equation(s) cannot be represented in ASCII text],
which are, respectively, estimated at gamma[sub 2] = gamma[sup *, sub 2] and at gamma[sub 1] = gamma[sup *, sub 1] and where the right side of the two previous expressions comes straightforwardly from (A.4) (one must pay special attention to the normalizing constant, which is evaluated by the normal cdf available in most standard statistical toolboxes).
B.2 The OP Case
For this model, because no autocorrelation is included, (B.2) is not necessary. Furthermore, (B.1) is replaced by
(B.5) [Multiple line equation(s) cannot be represented in ASCII text],
(B.3) by
(B.6) [Multiple line equation(s) cannot be represented in ASCII text],
and (B.4) by
(B.7) [Multiple line equation(s) cannot be represented in ASCII text].
Estimation of Bayes factors for both models is summed up in Table B.1.
(Multiple lines cannot be converted in ASCII text).$Z
~~~~~~~~
By Philippe Girard, Nestle Research Center Quality and Safety Assurance
Department Nestec LTD 1000 Lausanne 26 Switzerland
(philippe.girard@rdls.nestle.com) and Eric Parent, Laboratoire GRESE French
Institute of Forestry, Agricultural and Environmental Engineering 75732 Paris
Cedex 15 France (parent@engref.fr)
Title: | Predicting adverse infant health outcomes using routine screening variables: modelling the impact of interdependent risk factors. |
Source: | |
Author(s): | |
Abstract: | This paper sets out a methodology for risk assessment of pregnancies in terms of adverse outcomes such as low birth-weight and neonatal mortality in a situation of multiple but possibly interdependent major dimensions of risk. In the present analysis, the outcome is very low birth-weight and the observed risk indicators are assumed to be linked to three main dimensions: socio-demographic, bio-medical status, and fertility history. Summary scores for each mother under each risk dimension are derived from observed indicators and used as the basis for a multidimensional classification to high or low risk. A fully Bayesian method of implementation is applied to estimation and prediction. A case study is presented of very low birth-weight singleton livebirths over 1991-93 in a health region covering North West London and parts of the adjacent South East of England, with validating predictions to maternities in 1994. [ABSTRACT FROM AUTHOR] |
AN: | 3954096 |
ISSN: | 0266-4763 |
Database: | Business Source Premier |
ABSTRACT This paper sets out a methodology for risk assessment of pregnancies in terms of adverse outcomes such as low birth-weight and neonatal mortality in a situation of multiple but possibly interdependent major dimensions of risk. In the present analysis, the outcome is very low birth-weight and the observed risk indicators are assumed to be linked to three main dimensions: socio-demographic, bio-medical status, and fertility history. Summary scores for each mother under each risk dimension are derived from observed indicators and used as the basis for a multidimensional classification to high or low risk. A fully Bayesian method of implementation is applied to estimation and prediction. A case study is presented of very low birth-weight singleton livebirths over 1991-93 in a health region covering North West London and parts of the adjacent South East of England, with validating predictions to maternities in 1994.
1 Introduction
There are a variety of methods to assess the risk of adverse maternity outcomes. This paper considers methods that allow for conceptually different types of risk, which affect maternity outcome, and so may provide a more sensitive basis for monitoring pregnancies than a single overall risk score that conflates different risk concepts. Perinatal outcomes reflect a variety of interacting influences of biomedical, psychosocial and lifestyle factors (Herrera et al., 1997; Schwartz, 1982), and these may not act cumulatively. While a single score has attractions, when the causality of the event is multidimensional and conceptually distinct risk factors are correlated, use of a single score may reduce the sensitivity of out of sample predictions (Philosophov & Ryabinina, 1997).
The present paper focuses only on those factors known early in pregnancy: if risk scores incorporate complications that develop in pregnancy or are based on indicators observable just before delivery, then maternities with complications qualify almost automatically as high-risk, so giving an exaggerated impression of the sensitivity of the screening system (Clarke et at., 1993). In particular, including developing variables tends to diminish the effect of socio-demographic variables on the outcome (Kiely, 1991), especially if the developing variables are intervening between the social variables and the outcome.
A variety of scoring methods have been based on combining relative risks associated with a wide range of observed indicators. These generally result in a single risk index combining the risks implied by indicators of maternal history, health behaviour (e.g. maternal smoking status) or maternal physiology (e.g. height and weight) and may be obtained by logistic regression techniques (Clarke et at., 1993; Ross et al., 1986). These methods attempt to find a range of relatively independent risk factors using the usual variable selection methods. A somewhat different methodology has been developed for screening of malformations such as Downs Syndrome (Wald & Cuckle, 1989). This allows for the interdependence of risk factors in producing a high or low risk of an adverse outcome. A range of risk indicators or biochemical markers are combined, and a likelihood ratio combining all the markers is used to assess the likely risk category.
A feature of screening technology is the use of the Bayes formula to update the a priori probability of high-risk by the likelihood ratio (Larson et at., 1998). Nevertheless, the majority of existing methods use classical estimation methods to estimate the impact of risk factors or the parameters of the marker density. The present paper adopts a fully Bayesian methodology in estimation; this has advantages in terms of fully modelling the sources of uncertainty in estimation and out-of-sample prediction of risk.
In the present analysis, the major dimensions of risk combine the impacts of subsets of available indicators and are summarized as normally distributed scores with which multivariate classifications are then made. The study concerns risk factors for very low birth-weight, under 1500 g (VLBW), in a health region covering NW London and parts of the adjacent South East of England. Following previous research, major groups of risk factors for low birth-weight are socio-demographic variables (e.g. maternal age and ethnicity), maternal health behaviour and medical history, and fertility and obstetric history (e.g. parity, previous low birth-weight). These are variables that can be observed at the mother's first antenatal booking with a view to allocating extra community resources or hospital tests. Subsequent quality of care in the pregnancy is likely to affect the outcome (Kotelchuck, 1994), but is, to a considerable degree, a 'developing variable'.
Birth-weight is a strong influence on infant mortality as well as later childhood morbidity and adult health. While the general trend is to higher birth-weight, very low birth-weights are also increasing (Power, 1994). In the present study, risk scores are developed from very low birth-weights over the period 1991-93. Predictive accuracy is assessed for subsequent VLBW births in 1994.
2 Screening with multiple risk factors
Studies of risk in individual maternities show the strong predictive value that maternal health history and health behaviour, previous fertility events, and maternal physiology have for adverse birth outcomes. For example, for low birth-weight, a range of studies have shown maternal smoking, low maternal height, and teenage motherhood as risk factors. However, there are also clear socio-economic influences (e.g. social class and household income), and even after accounting for individual level socio-economic maternal background, the area of residence, especially in terms of its deprivation rating, has been found to have an independent influence (Newlands et al., 1992). Similarly, the work of Herrera et al. (1997) shows the predictive strength of psychosocial as distinct from bio-medical risk factors. In the present study, medical and physiological risk factors are supplemented by fertility history and socio-demographic background, including a measure of deprivation in the mother's small area of residence.
Systems aimed to produce a single maternal risk score (e.g. via logistic regression) have an advantage of adjusting for co-varying influences on the outcome: for example, the adverse effect of teenage motherhood is confounded with socioeconomic status (Geronimus, 1986). On the other hand, there are problems with subjectivity in model choice, even when a more formal approach to model selection is applied. The model selection associated with a particular data set generates an additional sampling variability when predicting risk in new maternities (Phillips et al., 1990, p. 1191). Additionally, sensitivity may be reduced in risk assessment in new maternities when a single risk score combining conceptually distinct dimensions of risk is relied upon.
An alternative approach to discriminating high-risk populations, which may have utility when risk is multidimensional, is to obtain a set of risk measures or summary dimensions of risk based on observed indicators, and to assess their interdependence via distributional modelling. Suppose q dimensions of risk are available and assumed to follow a multivariate normal distribution in each set of maternities, or possibly multivariate t if robustness to outliers is an issue. Suppose we wish only to discriminate between high and low risk pregnancies PI[sub h] or PI[sub 1] and that these differ in their averages on the risk factors (mu[sub h], mu[sub 1]) and also on the covariances (SIGMA[sub h], SIGMA[sub 1]) between factors. For maternity i, with risk measures X[sub i] = X[sub i1], X[sub i2], ..., X[sub iq] classification into the two populations PI[sub h] or PI[sub 1] can be made on the basis of the likelihoods f[sub h](X[sub i]) or f[sub 1](X[sub i]), specifically using the ratio f[sub h](X[sub i])/f[sub 1](X[sub i]). Sometimes prior knowledge of the events, r[sub 1] and r[sub h] (e.g. r[sub h] = 0.01 if 1% of maternities are associated with VLBW infants) are available. The rule is then to allocate according to the ratio of r[sub h]f[sub h](X[sub i]) to r[sub 1]f[sub 1](X[sub i]). Alternatively, if the parameters of f[sub h](X[sub i]) and f[sub 1](X[sub i]) are estimated from N[sub h] cases and N[sub 1] non-cases then the 'prior probabilities' are often based on these numbers, especially if there is differential sampling from affected and unaffected populations.
For the two population allocation problem, we may consider a range of cut-off thresholds in the ratios r[sub h]f[sub h](X[sub i])/r[sub 1]f[sub 1](X[sub i]) or N[sub h]f[sub h](X[sub i])/N[sub 1]f[sub 1](X[sub i]), and assess how the probabilities of misclassification are affected as the threshold changes (Phillips et al., 1990). Let P[sub h] = SIGMA[sub h]-[sup 1] and P[sub 1] = SIGMA[sub 1][sub -1] denote the inverse dispersion, or precision, matrices in the two populations. If the two multivariate normal densities are assumed to have different means and covariances, then the log of the likelihood ratio (LLR) has the quadratic form
Multiple line equation(s) cannot be represented in ASCII text (1)
Including the prior odds ratio in the calculation simply shifts the LLR by a fixed amount. If the dispersion matrices are equal SIGMA[sub h] = SIGMA[sub 1] = SIGMA, then the LLR is linear.
Sensitivity and specificity (i.e. the probabilities that high-risk and low-risk mothers are correctly classified) is then assessed within the estimation sample, and a certain threshold may be selected. The prediction problem is then to apply this rule to new (out of sample) maternities and examine the extent of attenuation in sensitivity and specificity.
3 Methodologies to identify risk factors
Generally, the risk factors X can be seen as latent constructs, not generally observable directly. We have to measure them indirectly using observed p proxy indicators Y. Often, we are faced with observed indicators of the risk which are discrete (e.g. ordered or binary outcomes) as well as continuous indicators. One methodology that may be used here involves structural equation modelling (SEM) with a measurement model relating the Y[sub j](j = 1, ..., p) to the X[sub k](k = 1, ..., q). In the structural or causal model, the X[sub k] are then used to predict the outcome of interest, Z, which is often binary. Here, the Z[sub i] is very low birth-weight (binary), the X[sub ik] are postulated dimensions of risk and the Y[sub ij] are risk indicators observed early in pregnancy.
Typically, the measurement model in an SEM involves assumptions about which factors are relevant to 'predicting' the observed indicators. Thus, if Y[sub i1] denotes maternal marital status then non-zero loading may be postulated relating it to the socio-demographic risk factor X[sub i1], but a zero loading relates it to the adverse history X[sub i2] factor. The analysis would confirm or negate alternative loading patterns depending on comparative goodness of fit. This would involve regressions with a mixture of links from Y to X, e.g. logit if Y is binary, cumulative logit if Y is ordinal but discrete, and linear if Y is metric (e.g. maternal weight) (Muthen, 1984). There may then be issues around the best choice of link, whether the continuous indicators Y should be categorized or not, and so on. An SEM analysis with categorical and/or ordinal indicators Y would be more complex to estimate than one with only metric indicators, but epidemiological methods are often based on categorization of risk factors (Woodward, 1999), as categorization assists in analysis of gradients of health risk (e.g. by social class) and in detecting possible non-linearities.
An alternative strategy, where the risk indicators include a mixture of continuous and discrete variables, is to derive the latent scores X[sub ik] as the thresholds underlying probit or logit regressions of Z[sub i] on the Y[sub ij]. With probit regression, we derive normally distributed risk scores X[sub i1], ..., X[sub iq] for each maternity, whereas logit regression corresponds to a heavier tailed density for the risk scores (Albert & Chib, 1993). The approach adopted here continues epidemiological methods that use multivariate functions (i.e. discriminant functions combining the impact of many predictors) in stratifying or discriminating populations according to risk (Kahn & Sempos, 1989), but differs in defining subsets of the indicator variables according to the overall type of risk. It may have advantages over methods such as SEM in the ease with which prior knowledge of the impact of risk indicators (e.g. mother's ethnicity) on the outcome (birth-weight) may be incorporated, and in the use of categorical indicators.
Here we postulate q = 3 risk factors, and their associated indicators, with p = p[sub 1] + p[sub 2] + p[sub 3]. For example, the p[sub 1] = 4 indicators of the socio-demographic risk factor X[sub 1] are taken as:
(1) age of mother (continuous but converted to a categorical factor with three levels),
(2) ethnicity (a categorical factor with four levels: white, black African and Caribbean, South Asian, and other),
(3) a deprivation score in ward of residence (originally continuous but converted to a risk factor with six levels), and
(4) marital status (a binary variable for all unmarried versus married).
Using probit regression with VLBW as the binary outcome Z[sub i], a single normally distributed score X[sub i1] is developed to summarize the impact of these indicators on risk in the estimation sample, and the posterior means of the risk score X[sub i1] for each new 1994 maternity are derived using their values on the p[sub l] = 4 indicators. Sets of p[sub 2] and p[sub 3] indicators are used for developing the factors X[sub 2], medical history and health behaviour and X[sub 3], obstetric history, to give three sets of normally distributed risk scores. These are then treated as potentially interdependent factors in a multinormal classification model: testing for correlation between factors or equality of dispersion matrices (SIGMA[sub h], SIGMA[sub 1]) are then aspects of model assessment.
4 Application: thresholds from probit regression
All singleton very low birth-weight babies and 1% of remaining singleton births over 1991-93 form the estimation sample. The analysis is confined to mothers with a previous birth so that fertility history is clearly defined. It consists of 184 VLBWs and 337 controls with the full range of risk factors observed. The controls are defined to have typical birth-weights, namely within one standard deviation of the average for all infants; the means and standard deviations are specific to the infant's sex. The validation sample consists of the comparable population for the year 1994; it contains 52 VLBW births and 90 controls.
In general, a binary response model for outcome Z[sub i] assumes that the 'success' probability is pi[sub i] = F(Y[sub i]beta) where F(.) is a distribution function and so lies between 0 and 1. Underlying the differences in the chance of a VLBW outcome, a continuous latent variable X[sub i] is posited such that Z[sub i] = 1 if X[sub i] is positive, and Z[sub i] = 0 if X[sub i] is negative. Thus, suppose
X[sub i] = betaY[sub i] + u[sub i]
where the u[sub i] are independent and identically distributed according to the chosen distribution function F. Then a success occurs according to
Pr(Z[sub i] = 1) = Pr(X[sub i]>0) = 1 - F(- betaY[sub i])
For forms of F that are symmetric about zero, this expression restates the equality pi[sub i] = F(betaY[sub i]), and so is equivalent to the usual probit or logit regression formulation.
If F is the cumulative Normal, then sampling of X may be based on draws from a truncated normal: truncation is to the right (i.e. zero is the ceiling value) if Z[sub i] = 0, and to the left by zero if Z[sub i] = 1 (Albert & Chib, 1993). In the WINBUGS program used here, this involves setting different sampling limits according to whether the observed outcome Z[sub i] is 1 or 0. Thus, truncated normal sampling of the risk factors X from the interval I(L, H) would involve normal sampling as follows
Multiple line equation(s) cannot be represented in ASCII text (2)
Typical sampling bounds might be L[1] = -20, and H[2] = 20 with L[2] = 0 and H [1] = 0 automatically.
To approximate a logit link, the X[sub i] can be sampled from a t density with 8 degrees of freedom, using the result that a t[sub 8] variable is approximately 0.634 times a logistic variable (Albert & Chib, 1993). This can be done by direct t-sampling or by retaining the normal sampling and introducing additional latent scale mixture variables lambda[sub i], such that X[sub i] ~ N(betaY[sub i],lambda[sub i][sup -1]), with lambda[sub i] sampled from a Gamma density G(4, 4).
A useful diagnostic feature resulting from this approach is that the residuals
X[sub i] - betaY[sub i]
are nominally a random sample from the distribution F (Johnson & Albert, 1999), so a Bayesian approach offers a method for detecting outliers, and hence the need for the more robust logit link.
The cut-off point to distinguish cases from non-cases is X[sub i] = 0, corresponding to an estimated pi[sub i] = 0.5 and the same cut-off may be applied in predicting risk in the new maternities. However, there may be a case for testing different cut-offs to assess changes in the resulting sensitivity-specificity balance. The risk scores X[sub ik.new] have means mu[sub ik.new] = betaY[sub i.new] where Y[sub i.new] are the indicator variables used in 1991-93 but updated to the new 1994 maternities.
Estimation is via Bayesian methods, which have the advantage of (a) facilitating the incorporation of existing accumulated knowledge independent of the present study and (b) allowing for the joint impact of several sources of uncertainty to be modelled. Estimation is via iterative simulation using Monte Carlo Markov Chain methods, in particular the Gibbs sampling routines incorporated in the BUGS package (Bayes Using Gibbs Sampling) of Spiegelhalter et al. (1996). The results of Bayesian estimation are presented as posterior summaries. These summaries include the posterior mean of each parameter from the sampling output and a credible interval with 95% empirical probability of containing the parameter value. Analyses are based on a single run of 10000 iterations, with estimates of the posterior means of mu[sub i] and mu.new[sub i] based on iterations 10 000-11 000. Autocorrelation plots for the parameters indicate a fast decline in correlation at lags 1, 5 and 10, except for the age of mother regression coefficients. However, subsampling every tenth iteration of the chains for the latter made little difference to the posterior estimates.
5 Risk factor regression results
For the probit analysis to develop risk scores, minimally informative priors-those consistent with prior ignorance--are assumed for the regression coefficients. Specifically, it is assumed that beta[sub j] ~ N(0, 1000) where j ranges over the predictors included in each risk dimension model (2) above. This may be rather conservative as there is extensive prior knowledge about the risk factors for low birth-weight. For the categorical predictor variables, the first category is taken as a reference category having zero effect. Thus, if predictor k has C categories then beta[sub k1] = 0, and
beta[sub km] ~ N(0, 1000) m = 2, ..., C.
We then examine the centred coefficients b[sub km] = beta[sub km] -beta[sub k].
The probit regression parameters relating Z[sub i] to the p[sub 1] = 4 socio-demographic indicators Y[sub i] = {Y[sub i1], ..., Y[sub i4]} are presented in Table 1. The probit slope parameters determine change on the standard normal scale. For example, a total X[sub i] = betaY[sub i] + u[sub i] of - 3 leads to a value PHI(-3) = 0.001, while X[sub i] = 1 gives PHI(1) = 0.84. The original coefficients had white, affluent, married women aged 15-19 as a reference category, but those in Table 1 have been centred. The reference category for the small area deprivation effect is affluent wards with Carstairs scores below - 3.
Among the indicators, the highest positive coefficients, and hence the highest risks of VLBW are for Afro-Caribbean mothers, for mothers over 30, and residence in a highly deprived ward. The contrast between Afro-Caribbean women and white women is clearly significant, and that between an affluent and deprived small area of residence is weighted towards positive values (the 95% credible interval is from -0.3 to 1.4). A positive impact of area deprivation on low birth weight has also been found by Reading et al. (1993), while Jarvelin et al. (1997) obtain similar findings with a locality wealth measure.
The posterior densities of the coefficients for older mothers, and for unmarried mothers (lone parents and in cohabiting partnerships) are skewed towards positive values. Sanjose & Roman (1991) identify single mothers as having a high-risk of small-for-gestational-age babies. As compared with married women, single women may be less likely to use health services, more likely to smoke, and have lower incomes and/or social support during pregnancy. Older mothers are at risk of a number of adverse birth outcomes, including low birth-weight and stillbirths. These two effects, as in Table 1, are in conventional terms not 'significant' since both their 95% and 90% credible intervals include negative values. However, the weight of evidence is towards their having a positive efFect--in the sense of enhancing the risk of very low birth-weight. The accumulated research evidence might well have justified informative priors that would have produced 'significant' effects for these variables. It may be noted that the credible interval for the contrast in risk between older mothers and mothers aged 20-34 is entirely focused on positive values.
Adding together the coefficients for being married, white, living in affluent wards and aged 20-34, shows the risk of a very low birth-weight as 0.09 or 9%. This is obtained as PHI(-1.35) where -1.35 = -1.05 -0.30 and -0.30 is the uncentred coefficient for ages 20-34. In contrast a black, unmarried mother from a deprived ward has a risk of having a very low birth-weight child of 52%. Note that these risks should be set against the disproportionate weighting of the estimation sample (184 VLBWs, 337 controls) as compared with the typical population levels of very low birth-weight.
The main goal of the current analysis is to derive scores that can be combined into an overall classification rule using multidimensional methods. However, it is of interest to consider the prediction of risk in the new maternities using each score X[sub i1.new], X[sub i2.new], and X[sub i3.new] alone. For illustration, the critical values
C[sub i] = X[sub i1.new] + 0.5 (3)
in the new socio-economic risk score are used to identify high risk in 1994 maternities. The proportion of very low birth-weight children detected (sensitivity) averages around 56% whereas these children account for 36% of the new sample. Specificity is around 49%. The 95% interval for the sensitivity ranges from 42% to 71%. These detection rates should be assessed in terms of their using only risk factors known at booking (i.e. registering with antenatal services, usually relatively early in a pregnancy). Higher sensitivity from this and subsequent scores would be achieved using the maternal history known later in the pregnancy or at start of labour, but this may be confounding outcome and risk by using 'developing' variables as indicators (Alexander & Keirse, 1989).
The medical history and health behaviour variables are more clearly indicated as 'significant' risk factors in the present setting (see Table 2) and accordingly have higher predictive power for new maternities. Using the above break point, the average sensitivity of prediction for the 142 new births is similar (46%) with this set of variables to that of socio-economic variables, but specificity is higher at 73%. The highest risk attaches to a history of hypertension and to late antenatal booking. The latter variable may, to some degree, indicate quality of medical care as well as psychosocial factors; its relationship to birth-weight has been demonstrated by Florey & Taylor (1994), who found that care begun in the first trimester was associated with a 140 g greater birth-weight. In particular, variables such as delay in antenatal booking and, more generally the adequacy of prenatal visits, may be seen as a proxy for the bio-psychosocial risks of low birth-weight: this has been argued for in a Californian study of low birth-weight (Stinson et al., 2000).
A history of chronic hypertension (a condition distinct from pregnancy-induced or gestational hypertension) is a predisposing factor for pre-eclampsia, although the majority of chronically hypertensive women do have a normal perinatal outcome (Enkin et al., 1992). There is a 'dose-response' relation observed in the effect of smoking, consistent with evidence that smoking has an adverse impact on foetal growth (birth-weight for gestational age), this impact being independent of any interrelation between social class and smoking (Peacock et al., 1995).
The estimated contrast in risk between no smoking and moderate or heavy smoking is clearly significant. A non-smoker in the baseline categories on other medical/health variables (i.e. having a BMI under 30, a gestation at booking under 20 weeks, no diabetic history, no hypertension history, and weight over 50 kg) has a risk of 16.8% of a very low birth-weight child. In contrast, a moderate/heavy smoker also in these baseline categories has a risk of PHI(-0.49) = 0.36 or 36% (-0.49 is the sum of -0.96, the intercept, and 0.61).
Of the history variables, all three of the predictors, namely previous LBW occurrence, previous infant death and high parity are clear as high-risk factors (see Table 3). Similarly, Dowding (1981) identified the incidence of low and suboptimal birth-weight as highest in fifth and subsequent births, while Sanjose & Roman (1991) identified previous perinatal death as a strong predictor of low birth weight. Using the above breakpoint (3), the average sensitivity of prediction for the 142 new births is 58% and specificity is 61%.
6 Multidimensional screening using the risk constructs
The posterior means of the scores of each mother on the three postulated risk constructs X[sub i1],X[sub i2],X[sub i3] are derived in both the sample data (1991-93) and new data (for 1994). They may be used in a stochastic version of multinormal discriminant analysis (Mardia et al., 1977, chapter 11). As argued by Lavine & West (1992), Bayesian analyses of normal mixture models for classification and discrimination allow the computation of exact posterior classification probabilities for observed data and for future cases.
Assume we simply want to distinguish high from normal risk. Then the high risk threshold points for both estimation and validation samples are subject to a tradeoff between positive predictive values, sensitivity and specificity. This trade-off can be identified most completely via ROC curves, although they may be complicated by differential costs. Thus, in medical treatment terms, false negatives may be ultimately more costly than false positives (e.g. in terms of costs for special care of infants), and so a full analysis would also involve a loss function and some knowledge of the costs of possible further tests (Geisser, 1993). Here, we confine our analysis of threshold choice to the impact of a small set of options for identifying high risk.
As above, we initially consider a multinormal for the two populations and unequal dispersion matrices. Suppose the multivariate normal densities on q variables are denoted f[sub h](x) and f[sub 1](x). Here q = 3, so f[sub h] and f[sub 1] may be denoted N[sub 3](mu[sub h],P[sub h]) and N[sub 3](mu[sub 1], P[sub 1]). The means mu[sub h] and mu[sub 1] are three-dimensional vectors, themselves distributed according to a non-informative hyper-prior N[sub 3](0, G[sup -1]) where the off-diagonals of the precision matrix G are zero and the diagonal terms are 0.001. The precision matrices P[sub h] and P[sub 1] are assumed to follow Wishart priors W(R, 3), with scale matrix R assigned values R[sub 11] = R[sub 22] = R[sub 33] = 0.01, and the offdiagonals set to 0.005. This matrix is the prior guess at the order of magnitude of the covariance matrices SIGMA[sub j] = P[sub j][sup -1] (j = h,l).
Using the posterior estimates of SIGMA[sub j], we derive the correlations between the three sets of scores in the two populations (Table 4). These show modest associations, among the unaffected cases, between history and social background, and between history and medical risk. Among the affected there is a small negative correlation between social and biomedical risk. However, there are clear differences in the variability of the dimensions between the VLBW and other maternities, with the VLBW maternities showing higher variability on all three.
To predict VLBW using all three scores jointly, we initially adopt a break-point in the log-likelihood ratio of 1, approximately the third quartile in these ratios in the 521 estimation sample maternities. The sensitivity among the sample cases is 53% with a credible interval from 47 to 57%, and the average specificity is 90%. The actual relative frequency of very low birth-weight maternities in the estimation sample is 35%. Among the new maternities, sensitivity is 43% and specificity 86%. This is compared with the actual relative frequency in the new set of 36%. Assuming false negatives are likely to be ultimately more costly than false positives, we might lower the risk threshold in the predictive situation, say to one less than that used in the estimation sample. On this basis, the sensitivity in the 1994 maternities reaches 49%, and specificity is 81%.
The choice of optimal screening procedure partly depends on the relative costs of a false positive and false negative.[1] As above, the sensitivity is denoted S' and specificity F', so the false positive rate is F = 1 - F', and the false negative rate S = 1 - S'. If these two outcomes are assumed equally costly, the Youden's index 1 - S - F may be used to discriminate between procedures. As argued above, there may be reasons to doubt the assumption of equality of costs, so the Youden index is best viewed as a simple way of combining the sensitivity and specificity (an ROC curve shows the same information graphically). Under this assumption, the index ranges from zero, implying the test has no diagnostic value, to 1 (when the test is invariably correct).
Accordingly, we vary the threshold in the LLR and see how rates of sensitivity S' and specificity F' vary. Table 5 shows specificities and sensitivities in the estimation sample for three values of the LLR, namely 0, 0.5 and 1. The threshold for prediction is set one lower (i.e. at - 1, - 0.5 and 0). The sensitivity among the new maternities reaches 62% with the LLR set at - 1, combined with a specificity of 68%, giving a Youden value of 0.30. The positive predictive values (proportions of true positives among apparent positives) falls, however, as the LLR threshold is reduced. The corresponding Youden values in Tables 1 to 3 (using each of the new scores on X[sub i1], X[sub i2] and X[sub i3] singly) are respectively 0.05, 0.19 and 0.19.
7 Discussion
The approach outlined aims to preserve the multidimensional nature of the risk affecting adverse birth outcomes. This multidimensionality is expected on substantive grounds for most health outcomes, as formally argued by the biopsychosocial model (Herrera et al., 1997; Stinsen et al., 2000). The approach implemented here gives improved predictive accuracy within and out of sample as compared to that obtained by screening separately using scores representing history, medical background, health behaviour and socio-economic factors. For example, the threshold of zero used to predict VLBW among new maternities yields a Youden index of 0.29 compared to a maximum of 0.19 using either X[sub i1], or X[sub i2] or X[sub i3] alone.
The sensitivity analysis of this approach may take several forms. One may modify the minimally informative priors in some or all of the initial three probit analyses. For example, the effects of the socio-economic variables on very low birth-weight have some uncertainty in Table 1, yet there is evidence from other studies of clear deprivation and ethnicity effects. One might, for example, assume South Asian and black ethnicity to be associated with an enhanced risk of low birth-weight. One might also assume a priori that there is a monotonic 'dose-response' effect of area deprivation or smoking level on low birth-weight. This would imply a constrained prior that can be implemented in a Gibbs sampling or other MCMC framework (Spiegelhalter et al., 1996).
A second option is to assume a robust model for deriving the latent risk scores, for example via a t density for X[sub i] in (2) (Albert & Chib, 1993). The multiple discrimination stage may also be made robust with a scale mixture. In the present analysis, a test of the latter option involved applying a multivariate t density (with 5 degrees of freedom), rather than a multivariate normal, to both populations. The LLR threshold was set at 0 (and - 1 in the new maternities). This yields virtually identical performance to the multivariate normal. If, however, we apply the multivariate t only to affected cases, then specificity and sensitivity in the new sample both improve slightly. The Youden index for predicting high risk in the 1994 maternities averages 0.305 rather than 0.300.
A further extension would involve model averaging. This would be relevant in the initial probit regression stages when there is model uncertainty, and several apparently disparate models (in substantive terms) yield similar measures of fit. It also has a particular role in the multinormal discrimination stage, in terms of the choice between different or equal variance-covariance matrices in the two populations (Smith & Spiegelhalter, 1981). These authors develop a Bayes factor expression on model 0 (equal variance-covariance matrices) versus model 1 (different variance-covariances).
Here, we adopt a discrete prior approach to test the unequal versus equal dispersion matrices hypotheses. The two hypotheses are assigned equal prior probability. Under the first there is one dispersion matrix applicable to both populations of maternities, while under the other there are distinct dispersion matrices. The relative proportions of selection between the two hypotheses in a long MCMC run show which hypothesis is supported by the data. A single run of 10000 showed no support for a common dispersion matrix between the two populations.[2]
For q = 2 markers, the analytic expression of Smith & Spiegelhalter (1981) for the Bayes Factor reduces to
Multiple line equation(s) cannot be represented in ASCII text
where GAMMA is the gamma function. As a sensitivity test, this was applied in an analysis of the present data but omitting the fertility history scores, namely to the bivariate scores {X[sub i1],X[sub i2]}. This again showed little support for the equality of variance-covariance matrices, with the Bayes factor essentially estimated as zero.
For policy and cost evaluation purposes, one may also incorporate 'cost effectiveness' into the discrimination problem (Geisser, 1993). This can be fitted into a Bayesian analysis by monitoring and comparing posterior expected losses when the mother is actually a high-risk or low-risk mother, but is misclassified.
The current methodology based on sampling latent risks X from probit or logit regression of Z on Y does not provide a unique answer to the classification problem addressed here. However, other methodologies, such as structural equation modelling, may well be more complex in terms of dealing with links between X and Y, and in assessing the performance of the classification instrument. The present methodology is also easily adapted to using informative priors on the effect of the indicators Y[sub i] on the outcome Z[sub i], which may be more complex to specify in terms of priors on loadings linking indicators to latent variables in the measurement sub-model of a structural equation approach. The present method would probably be improved in terms of accuracy (sensitivity, specificity) by adopting informative priors, e.g. monotonic dose-response impacts of maternal smoking or residential deprivation on VLBW.
Its performance would also apparently be improved by drawing on developing medical variables in pregnancy (e.g. complications as the date of delivery approached) but at the expense of possibly confounding the outcome and the risk scores. Using indicators defined at the start of pregnancy gives a more valid approach to assessing the bio-psychosocial model of maternity outcomes, and other multidimensional risk models for maternal and infant health outcomes.
Notes
1. The usual rule is to assign a maternity to the high-risk population PI[sub h] if
f[sub h]/(x)/f[sub 1](x) > r[sub 1]/r[sub h]
Additional information on the costs of misclassification may also be used when available: for example the cost of a false negative C(1|h), that is classifying a mother as low-risk when she is high-risk, may be more than the cost C(h|1) of a false positive. When costs are available the decision rule is to allocate to high-risk if
Multiple line equation(s) cannot be represented in ASCII text
If the cost of a false negative is greater, then this alternative rule reduces the threshold for risk, raising the sensitivity but reducing specificity.
2. Suppose we denote the number of options (hypotheses) for the covariance matrix as H. The number of populations (i.e. VLBW versus normal births) is K = 2. We use a categorical prior to select among the covariance hypotheses. The relevant code in BUGS (which parameterizes the normal in terms of precisions rather than variances-covariances) is as follows:
# BUGS code for K = 2 populations, q markers, unequal and equal precision matrices (Th and Tc),
# T = Th or T = Tc according to selection of kd = 1 or 2 for (k in 1 : K) {for (i in 1 : N[k]){
# multivariate likelihoods for two populations
X[k, i, 1 : q] ~ dmnorm(gamma[k,], T[kd, k, 1 : q, 1 : q])}}
# prior on covariance hypothesis
for (j in 1 : H) {prior [j] < -0.5}
kd ~ dcat(prior [1 : H]);
# Priors for precision matrices: for (k in l : q) {Rc[k, k] < -0.01; for (1 in (k + 1) : q) {Rc[k, 1] < -0.005; Rc[1, k] < -0.005; }}
Tc[1 : q, 1 : q] ~ dwish(Rc[1 : q, 1 : q], q); for (k in 1 : q) {R[i,k,k] < -0.01; for (1 in (k + 1) : q) {R[i,k,1] < - 0.005; R[i,1,k] < - 0.005; }}
for (i in 1 : K) {Th[i, 1 : q, 1 : q] ~ dwish(R[i, 1 : q, 1 : q],q)
for (k in 1 : q) {for (j in 1 : q) {T[1,i,j,k]<-Th[i,j,k]
T[2,i,j,k] < - Tc[j, k]}}}
Correspondence: Peter Congdon, Department of Geography, Queen Mary and Westfield, Mile End Road, London E1 4NS, UK and Department of Public Health, Barking and Havering Health Authority, The Clock House, East Street, Barking, Essex IG11 8EY, UK.
TABLE 1. Socio-economic risk factors for very low birth weight
Legend for Chart: A - Credible interval, Mean B - Credible interval, St devn C - Credible interval, 2.5% D - Credible interval, 97.5% A B C D Out-of-sample performance Sensitivity 0.56 0.08 0.42 0.71 Specificity 0.49 0.07 0.36 0.61 Estimation sample parameters Constant -1.05 0.66 -2.06 0.19 Ethnicity (centred effects) White -0.29 0.15 -0.59 0.01 Caribbean/African New Commonwealth 0.38 0.22 -0.06 0.81 India/Pakistan 0.02 0.20 -0.37 0.40 Other -0.10 0.33 -0.75 0.55 Contrast, white vs Afro-Caribbean 0.67 0.27 0.14 1.21 Mothers Age (centred effects) Under 20 -0.16 0.46 -1.06 0.65 20-34 -0.13 0.24 -0.58 0.33 35 and over 0.30 0.27 -0.21 0.82 Contrast; 35 + vs 20-34 0.43 0.22 0.00 0.86 Residence Type (centred effects) Affluent wards[*] -0.29 0.24 -0.79 0.18 Middling wards 0.06 0.15 -0.23 0.36 Deprived wards 0.22 0.21 -0.18 0.63 Contrast: deprived vs affluent wards 0.51 0.43 -0.32 1.37 Marital State Unmarried mother 0.25 0.21 -0.18 0.66
* Affluent: Carstairs score under -3; Middling, Carstairs -3 to 6; Deprived, Carstairs over 6.
TABLE 2. Medical and behavioural risk factors for very low birth-weight[*]
Legend for Chart: A - Posterior summaries, Mean B - Posterior summaries, St devn C - Posterior summaries, 2.5% D - Posterior summaries, 97.5% A B C D Out-of-sample performance Sensitivity 0.46 0.07 0.31 0.60 Specificity 0.73 0.05 0.60 0.83 Estimation sample parameters Constant -0.959 0.099 -1.154 -0.769 Category BMI Over 30 -0.023 0.134 -0.246 0.196 Gestation at booking 20 weeks 0.685 0.118 or over 0.490 0.874 Diabetic History Yes 0.559 0.532 -0.341 1.419 Hypertension History Yes 0.671 0.138 0.445 0.898 Smoking Occasional 0.237 0.137 0.009 0.459 Moderate 0.608 0.138 or heavy 0.382 0.836 Centred smoking effects None -0.265 0.108 -0.480 -0.054 Occasional -0.203 0.160 -0.516 0.107 Moderate 0.468 0.138 or heavy 0.201 0.740 Contrast (Moderate/heavy vs none) 0.733 0.189 0.366 1.105 Maternal Weight Under 50 kg 0.460 0.136 0.240 0.686
* Effects are contrasted with BMI under 30, gestation at booking under 20 weeks, no diabetic history, no hypertension history, no smoking, weight over 50 kg.
TABLE 3. Previous fertility risk factors for very low birth-weight[*]
Legend for Chart: A - Posterior summaries, Mean B - Posterior summaries, St devn C - Posterior summaries, 2.5% D - Posterior summaries, 97.5% A B C D Out-of-sample performance Sensitivity 0.58 0.06 0.46 0.71 Specificity 0.61 0.07 0.48 0.74 Estimation sample parameters Constant -0.74 0.08 -0.90 -0.58 Category Parity 3 or more 0.256 0.119 previous births 0.028 0.492 Previous infant death Yes 0.626 0.307 0.061 1.254 Previous LBW(< 2.5 kg) Yes 0.920 0.116 0.696 1.147
* Effects are contrasted with maternities with under 3 previous births, no previous infant death, and no previous low birth-weight.
TABLE 4 Estimated dispersion and correlation matrices for two maternity populations
Correlations between dimensions[*] Unaffected 1 2 2 0.035 -- 3 0.163 0.165 Affected 1 2 2 -0.097 -- 3 -0.006 0.065 Variance-covariance matrices Unaffected 1 2 3 1 0.028 -- -- 2 0.003 0.142 -- 3 0.007 0.016 0.060 Affected (VLBW births) 1 2 3 1 0.053 -- -- 2 -0.012 0.245 -- 3 -0.003 0.016 0.325
* 1 = socio-demographic, 2 = health history and behaviour, 3 = fertility history.
TABLE 5. Sensitivity and specificity, estimation and new data, with varying LLR thresholds
Legend for Chart: A - Existing maternities (1991-93) B - Credible interval, Mean C - Credible interval, St devn D - Credible interval, 2.5% E - Credible interval, 97.5% A B C D E Threshold 0 PPV 0.68 0.01 0.66 0.71 Sensitivity 0.61 0.02 0.57 0.64 Specificity 0.84 0.01 0.82 0.87 Youden 0.45 0.01 0.43 0.47 0.5 PPV 0.73 0.01 0.70 0.75 Sensitivity 0.57 0.02 0.53 0.60 Specificity 0.88 0.01 0.86 0.90 Youden 0.45 0.01 0.43 0.47 1 PPV 0.74 0.01 0.73 0.76 Sensitivity 0.53 0.02 0.47 0.57 Specificity 0.90 0.01 0.89 0.91 Youden 0.43 0.02 0.37 0.46 New maternities (1994) Threshold -1 PPV 0.53 0.03 0.49 0.60 Sensitivity 0.62 0.02 0.58 0.67 Specificity 0.68 0.05 0.62 0.77 Youden 0.30 0.04 0.25 0.38 -0.5 PPV 0.56 0.04 0.50 0.63 Sensitivity 0.54 0.03 0.50 0.58 Specificity 0.75 0.04 0.68 0.82 Youden 0.29 0.04 0.21 0.35 0 PPV 0.60 0.04 0.52 0.66 Sensitivity 0.49 0.02 0.46 0.50 Specificity 0.81 0.04 0.73 0.87 Youden 0.29 0.03 0.23 0.33
REFERENCES
ALBERT, J. & CHIB, S. (1993) Bayesian analysis of binary and polychotomous response data, J. Amer. Stat. Ass., 88, pp. 669-679.
ALEXANDER, S. & KEIRSE, M. (1989) Formal risk scoring during pregnancy. Chapter 22 in Effective Care in Pregnancy and Childbirth, Vol. I, M. ENKIN, M. KEIRSE & I. CHALMERS (Eds) (Oxford University Press).
CLARKE, M., MASON, E., MCVICAR, J. & CLAYTON, D. (1993) Evaluating perinatal mortality rates: effects of referral and case mix, British Medical J., 306, pp. 824-827.
COSTE, J., BOUYER, J. & JOB-SPIRA, N. (1997) Construction of composite scales for risk assessment in epidemiology: an application to ectopic pregnancy, Am. J. Epidemiology, pp. 278-289.
DOWDING, V. (1981) New assessment of the effects of birth order and socioeconomic status on birth-weight, British Medical Journal, 282, pp. 683-686.
ENKIN, M., KEIRSE, M. & CHALMERS, I. (1992) A Guide to Effective Care in Pregnancy and Childbirth (Oxford University Press).
FLOREY, C. & TAYLOR, D. (1994) The relation between antenatal care and birth-weight, Rev. Epid et Sante Publ., 42, pp. 191-197.
GEISSER, S. (1993) Predictive Inference: an Introduction (New York, Chapman and Hall).
GELMAN, A., CARLIN, J., STERN, H. & RUBIN, D. (1995) Bayesian Data Analysis (Chapman and Hall).
GERONIMUS, A. (1986) The effects of race, residence and prenatal care in the relationship of maternal age to neonatal mortality, American Journal of Public Health, 76, pp. 1416-1421.
HERRERA, J., SALMERON, B. & HURTADO, H. (1997) Prenatal biopsychosocial risk assessment and low birth-weight, Social Science in Medicine, 44(8), pp. 1107-1114.
JARVELIN, M., ELLIOTT, P., KLEINSCHMIDT, I., MARTUZZI, M., GRUNDY, C., HARTIKAINEN, A. & RANTAKALLIO, P. (1997) Ecological and individual predictors of birth-weight in a northern Finland birth cohort, 1986, Paediatric and Perinatal Epidemiology, 11, pp. 298-312.
KAHN, H. & SEMPOS, C. (1989) Statistical Methods in Epidemiology (Oxford University Press).
KIELY, J. (1991) Some conceptual problems in multivariable analyses of perinatal mortality, Paediatric and Perinatal Epidemiology, 5, pp. 243-257.
KOTELCHUCK, M. (1994) The adequacy of prenatal care utilization index: its US distribution and association with low birth-weight, American Journal of Public Health, 84, 1486-1489.
LAVINE, M. & WEST, M. (1992) A Bayesian method for classification and discrimination, Canadian Journal of Statistics, 20, pp. 451-461.
MUTHEN, B. (1984) A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators, Psychometrika, 49, pp. 115-132.
NEWLANDS, M., ADAMSON, E., GHULAM, S., SALEH, M. & EMERY, J. L. (1992) Jarman index related to post-perinatal mortality, Public Health, 106(2), pp. 163-165.
PEACOCK, J., BLAND, J. & ANDERSON, H. (1995) Preterm delivery: effects of socioeconomic factors, psychological stress, smoking, alcohol and caffeine, British Medical Journal, 311, 531-535.
PHILLIPS, A., THOMPSON, S. & POCOCK, S. (1990) Prognostic scores for detecting a high-risk group: estimating the sensitivity when applied to new data, Statistics in Medicine, 9, pp. 1189-1198.
PHILOSOPHOV, L. & RYABININA, L. (1997) Medical diagnostic rules based on mutually dependent diagnostic factors, Computing in Medicine and Biology, 27, pp. 329-347.
POWER, C. (1994) National trends in birth-weight: implications for future adult disease, British Medical Journal, 308, pp. 1270-1271.
READING, R., RAYBOULD, S. & JARVIS, S. (1993) Deprivation, low birth-weight and children's height: a comparison between rural and urban areas, British Medical Journal, 307, pp. 1458-1462.
ROSS, M., CALVIN, J., BRAGONIER, J., BEAR, M. & BEMIS, R. (1986) A simplified risk-scoring system for prematurity, Am. J. Perinatology, 3(4), pp. 339-344.
ROYSTON, P. & THOMPSON, S. (1992) Model-based screening by risk with application to Downs Syndrome, Statistics in Medicine, 11, pp. 257-268.
SANJOSE, S. & ROMAN, E. (1991) Low birth-weights, preterm and small for gestational age babies in Scotland, 1981-1984, Journal of Epidemiology and Community Health, 45, pp. 207-210.
SCHAFER, J. (1996) Analysis of Incomplete Multivariate Data (London, Chapman and Hall).
SCHWARTZ, G. (1982) Testing the biopsychological model: the ultimate challenge facing behavioural medicine, Journal of Consulting and Clinical Psychology, 50(6), pp. 1040-1053.
SMITH, A. & SPIEGELHALTER, D. (1981) Bayesian approaches to multivariate structure. Chapter 17 in Interpreting Multivariate Data, V. BARNETT (ed) (Chichester, Wiley).
SPIEGELHALTER, D., THOMAS, A., BEST, N. & GILKS, W. (1996) BUGS: Bayesian Inference Using Gibbs Sampling, Version 0.50 (Cambridge, MRC Biostatistics Unit).
STINSON, J., LEE, K., HEILEMANN, M., GOSS, G. & KOSHAR, J. (2000) Comparing factors related to low birth-weight in rural Mexico-born and US-born Hispanic women in northern California, Family and Community Health, 23(1), pp. 29-39.
WALD, N. & CUCKLE, H. (1989) Reporting the assessment of screening and diagnostic-tests, British Journal of Obstetrics and Gynaecology, 96, pp. 389-396.
WOODWARD, M. (1999) Epidemiology (Chapman & Hall/CRC Statistics and Mathematics).
~~~~~~~~
By Peter Congdon, Department of Geography, Queen Mary and Westfield, London
and Barking and Havering Health Authority, Essex, UK epartment of Mathematics
and Statistics, Jomo Kenyatta University of Agriculture and Technology, Kenya
Title: | Bayesian Methods (Book Review). |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Reviews the book `Bayesian Methods,' by Thomas Leonard and John S.J. Hsu. |
AN: | 4047737 |
ISSN: | 0040-1706 |
Database: | Business Source Premier |
This book presents Bayesian inference and model building primarily using normalizing transformations for parameters that lead to approximately normal priors and likelihoods. Such transformations afford excellent posterior approximations yielding estimators with good frequentist performance. With this focus, the book differs from two other recent books on the application of Bayesian statistics, those by Carlin and Louis (1996) and Gelman, Carlin, Stem, and Rubin (1995). The latter emphasize noniterative and iterative Monte Carlo (MC) methods in the analysis of posteriors. But the difference is complementary. Users of MC methods have need of approximations like those here, making this a useful addition to any practitioner's library.
Bayesian Methods is pregnant with detailed examples, pulled primarily from recent literature, especially from contributions by the authors. Rather than serving simply as illustrations of results in the text, these examples are an integral part of the authors' development. What is more, they are interesting! They endow the theory with life and draw the reader deeper into the text.
The level of the book is consistent with its subtitle, An Analysis for Statisticians and Interdisciplinary Researchers. The reader should have a solid background in mathematical statistics (at the level of Casella and Berger 1990, for example) and at least some understanding of linear models and time series. In addition to familiarity with a comprehensive statistical package such as SAS or S-PLUS, facility with a good mathematical computing package such as Mathematica, Matlab, or Mathcad will be necessary to work through the exercises. The book is well suited as both a reference and for self-study. The index is thorough and the bibliography extensive, covering more than 15 10" x 7" pages.
Combining this book with Carlin and Louis (1996) and/or Gelman et al. (1995) would form the textual basis for a superb two-semester course on Bayesian methods. A one-semester course could be taught from the book, though some material would have to be deleted and more extensive coverage of MC methods would have to be provided by the instructor. The reader is encouraged to explore further afield because virtually all examples and many of the exercises are connected to the literature. For instance, one problem entitled "Smoothing a Logistic Regression Function (a semi-parametric procedure)" covers two pages and cites over a dozen related papers (pp. 249-250). In addition to the examples in the book, there are 49 "worked examples" and 148 "self-study exercises." The worked examples are presented as problems with detailed solutions.
The book contains six chapters. In the course of 74 pages, Chapter I ranges across the most important results from non-Bayesian statistical inference, with an emphasis on the likelihood, its properties, and its use. The reader's need for a strong mathematical statistics background is obvious from the start. On page 8 alone, for example, the authors define the likelihood, maximum likelihood estimators, entropy, the general information criterion, Akaike's information criterion, and the Bayesian information criterion. By page 14, there are exercises referring to unbiasedness (without explanation) and at least one uses the general linear model. Later in the chapter, the authors are careful to take more time with topics that may not have been covered in a typical mathematical statistics course, like histogram smoothing and the likelihood principle. Thus, the pace of this review chapter, though brisk, should not deter most Technometrics readers and well-prepared students.
The discrete version of the Bayes theorem is the topic of Chapter 2. Applications include model selection, logistic discrimination, and others. An interesting feature of this chapter is a two-page exposition on the use of the Bayes theorem for the evaluation of evidence. Their advice is "[in] contrast with current standard practice in legal cases, for example, for genetic testing and DNA profiling, which gives rise for social concern at an international level" (p. 96).
The Bayes theorem is applied to models with single real-valued parameters in Chapter 3. The authors pay close attention to the elicitiation of priors and model checking. The issue of vague prior information is extensively discussed, and here arises one of the few instances where the book puzzles me a little. Noting that a scientist is rarely in a state of complete ignorance, the authors then state the fact that "The Bayesian paradigm cannot formally handle complete prior ignorance" (p. 134). Next they imply, intentionally or otherwise, that non-Bayesian methods somehow can handle prior ignorance: "In such situations, the scientist can use likelihood methods if there is a sampling model available" (p. 134). They immediately proceed to focus on priors that handle little or vague information, but the "damage" has been done. I can well imagine a novice inferring from this that "conservative" or "objective" analyses demand non-Bayesian methods.
Both posterior and predictive inferential methods are extensively covered in Chapter 3. Results are obtained through conjugate analyses or approximations, often requiring normalizing transformations. Everything is effectively woven together using examples presented in great detail. Like Carlin and Louis (1996), the authors place a premium on the frequency properties of inferential methods. To that end, the chapter also considers loss functions, risk functions, Bayes risk, and related ideas.
The authors describe Chapter 4 as providing "a break to some of the technicalities" (p. xii). The break comes in the form of a 23-page treatment of the expected utility hypothesis! Although the material is well presented, it is essentially basic utility and decision theory and seems very out of place in this book. I would much prefer a chapter with more extensive coverage of MC methods or a lengthy appendix covering more of the computational details. I think Chapter 4 will be the least interesting to readers of Technometrics. Fortunately, the uninterested reader can skip this material with no loss in continuity.
Chapter 5 returns to the main course and tackles models with several parameters. The emphasis is on marginal densities in which calculations, if not analytical, are conducted using numerical integration methods, multivariate normal approximations, conditional maximization with respect to the nuisance parameters, and Laplacian approximations, all facilitated by parameter transformations. MC methods are sometimes employed, but their more detailed treatment is saved for Chapter 6. Examples include failure-time analyses, linear logistic models, prediction for regression, inference for the negative binomial, models with interaction for contingency tables, nonlinear regression, the Kalman filter, on-line analysis of time series, Bayesian forecasting in economics, and others.
Prior structures, posterior smoothing, Bayes-Stein estimation, and MC methods are the subjects of Chapter 6, the last chapter. Over half of the chapter is devoted to the detailing of transformations and approximations that form the unifying theme of the book. Multivariate normal priors for transformed parameters are given a lengthy treatment along with Laplacian approximations using posterior mode vectors. Several approaches to constructing a multivariate normal prior are discussed, including hierarchical Bayes, parametric empirical Bayes, and the marginal posterior mode compromise. Everything is tied together with practical examples. The treatment of iterative and noniterative MC methods, covering about 26 pages, is all too brief, but consistent with the theme of the book. Here one must turn, for example, to Carlin and Louis (1996), Gelman et al. (1995), Gilks, Richardson, and Spiegelhalter (1996), or Gamerman (1997).
I strongly recommend this book to anyone interested in Bayesian methods. I look forward to using it in the classroom. It has already become one of a half-dozen or so recent books on applications of the Bayesian paradigm to which I routinely refer. The only problem is, my students are constantly borrowing them!
Carlin, B. P., and Louis, T. A. (1996), Bayes and Empirical Bares Methods for Data Analysis, New York: Chapman and Hall (CRC Press).
Casella, G., and Berger, R. L. (1990), Statistical Inference, Pacific Grove, CA: Wadsworth & Brooks-Cole.
Gamerman, D. (1997), Markov Chain Monte Carlo, New York: Chapman and Hall (CRC Press).
Gelman, A., Carlin, J., Stern, H., and Rubin, D. B. (1995), Bayesian Data Analysis, New York: Chapman and Hall (CRC Press).
Gilks, W., Richardson, S., and Spiegelhalter, D. (1996), Markov Chain Monte Carlo in Practice, London: Chapman and Hall (CRC Press).
~~~~~~~~
By John W. Seaman Jr., Baylor University Occupational Safety and Health,
Centers for Disease Control and Prevention edu) and Jianjun Shi, Department of
Industrial and Operations Engineering The University of Michigan Ann Arbor, MI
48109-2117 (shihang@engin.umich.edu) n Kedem, Department of Mathematics
University of Maryland College Park, MD 20742 (bnk@math.umd.edu) and David A.
Short, Applied Meteorology Unit ENSCO Inc. Cocoa Beach, FL 32955
(shortd@fl.ensco.com) istical Methods University of Zaragoza 50009 Zaragoza
Spain (lacruz@posta.unizar.es)
Title: | Time-Varying Network Tomography: Router Link Data. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | The origin-destination (OD) traffic matrix of a computer network is useful for solving problems in design, routing, configuration debugging, monitoring, and pricing. Directly measuring this matrix is not usually feasible, but less informative link measurements are easy to obtain. This work studies the inference of OD byte counts from link byte counts measured at router interfaces under a fixed routing scheme. A basic model of the OD counts assumes that they are independent normal over OD pairs and iid over successive measurement periods. The normal means and variances are functionally related through a power law. We deal with the time-varying nature of the counts by fitting the basic iid model locally using a moving data window. Identifiability of the model is proved for router link data and maximum likelihood is used for parameter estimation. The OD counts are estimated by their conditional expectations given the link counts and estimated parameters. Thus, OD estimates are forced to be positive and to harmonize with the link count measurements and the routing scheme. Finally, maximum likelihood estimation is improved by using an adaptive prior. Proposed methods are applied to two simple networks at Lucent Technologies and found to perform well. Furthermore, the estimates are validated in a single-router network for which direct measurements of origin-destination counts are available through special software. [ABSTRACT FROM AUTHOR] |
AN: | 3851328 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
The origin-destination (OD) traffic matrix of a computer network is useful for solving problems in design, routing, configuration debugging, monitoring, and pricing. Directly measuring this matrix is not usually feasible, but less informative link measurements are easy to obtain.
This work studies the inference of OD byte counts from link byte counts measured at router interfaces under a fixed routing scheme. A basic model of the OD counts assumes that they are independent normal over OD pairs and iid over successive measurement periods. The normal means and variances are functionally related through a power law. We deal with the time-varying nature of the counts by fitting the basic iid model locally using a moving data window. Identifiability of the model is proved for router link data and maximum likelihood is used for parameter estimation. The OD counts are estimated by their conditional expectations given the link counts and estimated parameters. Thus, OD estimates are forced to be positive and to harmonize with the link count measurements and the routing scheme. Finally, maximum likelihood estimation is improved by using an adaptive prior.
Proposed methods are applied to two simple networks at Lucent Technologies and found to perform well. Furthermore, the estimates are validated in a single-router network for which direct measurements of origin-destination counts are available through special software.
KEY WORDS: Expectation-Maximization algorithm, Filtering, Normal, Inverse problem, Link data, Maximum likelihood estimation, Network traffic, Smoothing, Variance model.
Research on computer network monitoring and management is exploding. A statistical perspective is needed for solving many of these problems either because the desired measurements are not available directly or because one is concerned about trends that are buried in notoriously noisy data. The problem we consider in this article has both of these aspects: indirect measurements and a weak signal buried in high variability.
In a local area network (LAN), routers and switches direct traffic by forwarding data packets between nodes according to a routing scheme. Edge nodes connected directly to routers (or switches) are called origins or destinations, and they do not usually represent single users but rather groups of users or hosts that enter a router on a common interface. An edge node is usually both an origin and a destination depending on the direction of the traffic. The set of traffic between all pairs of origins and destinations is conventionally called a traffic matrix, but in this article we usually use the term origin-destination (OD) traffic counts to be specific. On a typical network, the traffic matrix is not readily available, but aggregated link traffic measurements are.
The problem of inferring the OD byte counts from aggregated byte counts measured on links is called network tomography by Vardi (1996). The similarity to conventional tomography lies in the fact that the observed link counts are linear transforms of unobserved OD counts with a known transform matrix determined by the routing scheme. Vardi (1996) studies the problem for a network with a general topology and uses an iid Poisson model for the OD traffic byte counts. He gives identifiability conditions under the Poisson model and discusses using the EM algorithm on link data to estimate Poisson parameters in both deterministic and Markov routing schemes. To mitigate the difficulty in implementing the EM algorithm under the Poisson model, he proposes a moment method for estimation and discusses the normal model as an approximation to the Poisson. Tebaldi and West (1998) follow up with a Bayesian perspective and an MCMC implementation but deal only with link counts from a single measurement interval. Vanderbei and Iannone (1994) apply the EM algorithm yet also use a single set of link counts.
This article focuses on time-varying network tomography. Based on link byte counts measured at the router interfaces and under a fixed routing scheme, the time-varying traffic matrix is estimated. The link counts are readily available through the Simple Network Management Protocol (SNMP), which is provided by nearly all commercial routers. The traffic matrix or OD counts, however, are not collected directly by most LANs. Such measurements typically require specialized router software and hardware dedicated to data collection. This is not practical for most networks. Nevertheless, network administrators often need to make decisions that depend on the traffic matrix. For example, if a certain link in the network is overloaded, one might either increase the capacity of that link or adjust the routing tables to make better use of the existing infrastructure. The traffic matrix can help determine which approach would be more effective and locate the source of unusually high traffic volumes. An Internet Service Provider can also use the traffic matrix to determine which clients are using their network heavily and charge them accordingly rather than using the common flat rate pricing scheme.
Two standard traffic measurements from SNMP are (1) incoming byte count, the number of bytes received by the router from each of the interfaces connected to network links; and (2) outgoing byte count, the number of bytes that the router sends on each of its link interfaces. We collect these incoming and outgoing link counts at regular 5-minute intervals from each of the routers in a local network at Lucent as diagrammed in Figure 1 (a). The six boxes represent network routers. The oval is a switch with a similar function but no capabilities for data collection. Individual nodes hanging from the routers connect to subnetworks of users, other portions of the corporate network, shared backup systems, and so forth. A data packet originating, for example, from one of the Router1 nodes and destined for the public Internet would be routed from Router] to the Switch, to Router4, through the Firewall to Gateway and then to the Internet. A network routing scheme determines this path.
Figure 2 presents link measurements on the subnetwork around Router1 as outlined in Figure lb. For several reasons, this article uses this simple network as the main example to illustrate the methodologies. One reason is that we have access to the system management staff for supplying data and for understanding details of the network; at the same time we have validation data available to check our estimates. Another reason for studying a small problem is that in larger networks, it often makes sense to group nodes and estimate traffic between node groups. In section 6, we discuss how estimation for all OD pairs in a large network can be done via a series of smaller problems on aggregated networks. We do not mean to imply, however, that all realistic problems can be reduced to our 4x4 example. Additional work is needed to scale up these methods to work on larger networks.
The 4 links on Router1 give rise to 8 link counts, with incoming and outgoing measurements for each interface. Time series plots of the link counts are shown in Figure 2. The byte counts are highly variable as is typical for data network traffic. The traffic levels rise and fall suddenly in ways that do not suggest stationarity. It is striking that a traffic pattern at an origin link interface can often be matched to a similar one at a destination interface and this can give a pretty good idea of where traffic flows. For example, origin local matches destination fddi at hour 1 and origin switch matches destination corp at hour 4. Such pattern matching may seem undemanding but the 4 x 4 = 16 OD count time series that we want to estimate are obviously not completely determined from the 8 observed link count series. There is often a large range of feasible OD estimates consistent with the observed link data. Our model-based estimates typically have have errors that are much less than these feasible ranges. Section 5 gives a specific example. Furthermore, visual pattern matching does not work well for the two-router network discussed in section 6.
The rest of this article is organized as follows. Section 2 gives a basic independent normal model for OD byte counts in which means and variances are functionally related through a power law. The section also proposes to estimate these counts using approximate conditional expectations given the link measurements, the estimated parameters, and the fact that the OD counts are positive. An iterative refitting algorithm ensures that estimates meet the routing constraints. Section 3 deals with the time-varying nature of traffic data by using a moving data window to fit the basic model locally and presents exploratory plots to show that the model assumptions are plausible for our example data. Section 4 develops a refinement of the moving iid model by supplementing the moving window likelihood with an adaptive prior distribution that is derived from modeling the parameter changes using independent random walks. Estimates from this model are then validated in section 5 using direct measurements of OD byte counts obtained by running specialized software on the router and dumping data to a nearby workstation. Section 6 provides a brief application of our methods to a two-router network. Section 7 concludes with directions for further work. The appendix gives proofs of the model identifiability.
2.1 Basic Model
Here we describe a model for a single vector of link counts measured at a given time; the counts at time t reflect the network traffic accruing in the unit interval (e.g., a 5-minute interval) ending at t. Subsequently, in section 2.3, we discuss fitting this model to a sample of such link count vectors measured over successive time intervals.
Let x[sub t] = (x[sub t,1],..., x[sub t,I])' denote the vector of unobserved link counts (average number of bytes per second, for example) of all OD pairs in the network at a given time t. In the Router1 network, y[sub t] has I = 16 elements. One element, for example, corresponds to the number of bytes originating from local and destined for switch. Although we have used the customary terminology "traffic matrix," it is convenient to arrange the quantities for OD pairs in a single vector, x,.
Let y[sub t] = (y[sub t,1],..., y[sub t], J)' be the vector of observed incoming and outgoing byte counts at each router link interface. For example, one element of y[sub t] corresponds to the number of bytes originating from local regardless of their destination. With 4 nodes, there are 4 incoming links and 4 outgoing links. In principle, the router is neither a source nor a destination. This implies that the total count of incoming bytes should be equal to the total count of outgoing bytes and thus the 8 link counts are linearly dependent. To avoid redundancies we remove the outgoing link count for the 4th interface, leaving J = 7 linearly independent link measurements.
In general, each element of y[sub t] is a sum of certain elements of x[sub t] with the exact relationship determined by the routing scheme. In matrix notation, these relationships are expressed as
where A is a J x I incidence matrix embodying the routing scheme. Typically d is much less than I because there are many more node pairs than links. Each column of A corresponds to an OD pair and indicates which links are used to carry traffic between that pair of nodes. We assume A to be fixed and known.
In the case of the single-router network around Router1 as shown in Figure 1(b), the routing matrix has entries of 0 except where indicated:
[Multiple line equation(s) cannot be represented in ASCII text]
Columns are associated with OD pairs and are ordered in a manner consistent with x[sub t]. The second column, for example, is for origin 1 and destination 2. The first row of the equation y[sub t] = Ay[sub t] corresponds to the statement that the observed byte count for traffic arriving on interface 1 is equal to the sum of the unobserved counts for traffic originating at node 1 and destined for nodes 1, 2, 3, and 4.
The unobserved OD byte counts, x[sub t] at time t are modeled as a vector of independent normal random variables
(1) x[sub t] Is similar to normal(lambda, SIGMA),
implying that the observed link byte counts, y[sub t], have distribution
(2) x[sub t] = Ax[sub t] Is similar to normal (A lambda, A SIGMA A').
The parameters are lambda = (lambda[sub 1],..., lambda[sub I])', lambda[sub i] > 0 and (3) SIGMA = phi diag (sigma[sup 2](lambda[sub 1]),...,sigma[sup 2](lambda[sub I])),
where phi > 0 is a scale parameter and the function sigma[sup 2] (.) describes a relationship that is assumed to be known between the mean and variance. Although SIGMA depends on lambda and phi, we suppress this from the notation. Equations (1)-(3) define the general model, but this article concentrates on a specific power law form for sigma[sup 2]:
(4) sigma[sup 2](lambda) = lambda[sup c]
where c is a constant.
The normal approximation to the Poisson model mentioned by Vardi (1996) is the special case phi = 1 and c = 1. It is necessary, however, to account for extra Poisson variability in our byte count data. Section 3.2 presents some exploratory data analysis suggesting that the power law (4) controls the relation between mean and variance. We regard phi primarily as a nuisance parameter to account for extra Poisson variation; but a change in phi can also be used to accommodate a change of the units of traffic measurements from "bytes" to "bytes/sec" that may be more intuitive to system administrators. In other words, our model is scale-invariant and the Poisson model is not. The power c is not estimated formally in our approach, but section 3 shows how to select a reasonable value and demonstrates the differences between two choices: c = 1 and c = 2.
Normal distributions describe continuous variables. Given the high speed of today's network and 5-minute measurement intervals, the discreteness of byte counts can be ignored. However, the normal model is only an approximation if only for the reason that x[sub t] is positive. If x[sub t] is Poisson distributed with a relatively large mean, then the approximation is good. Working in the normal family obviously makes the distribution of observation y[sub t] easy to handle and leads to computational efficiency as we shall see next. Section 3.2 explores model assumptions for our data.
2.2 Identifiability
Model (2)-(4) with a fixed power c is identifiable. This is stated in the following theorem and corollary. Proofs are given in the Appendix.
Theorem 1. Let B be the [J(J + 1)/2] x I matrix whose rows are the rows of A and the component-wise products of each different pair of rows from A. Model (2)-(4) is identifiable if and only if B has full column rank.
Corollary 1. For byte counts from router interface data, B has full column rank and thus model (2)-(4) is identifiable.
An intuitive explanation of the identifiability for the router interface data follows from hints that surface in the proof of the corollary. For the ith origin-destination pair, let y[sub o] represent the number of incoming bytes at the origin interface and let y[sub d] represent the number of outgoing bytes at the destination interface. The only bytes that contribute to both of these counts are those from the ith OD pair, and thus cov(y[sub o], y[sub d]) =phi lambda[sup c, sub i]. Therefore, hi is determined up to the scale phi. Additional information from E(y[sub t]) identifies the scale and identifiability follows. A similar reasoning justifies Vardi's moment estimator of lambda.
2.3 Estimation of lambda Based on lid Observations
The basic model given by (2)-(4) is identifiable, but reasonable estimates of the parameters require information to be accumulated over a series of measurements. We now describe a maximum likelihood analysis based on T lid link measurement vectors Y = (y[sub 1],..., y[sub T]) to infer the set of OD byte count vectors X = (x[sub 1],..., x[sub T]). This simple lid model forms the basis for time-varying estimates in section 3 where it is applied locally to byte count vectors over short windows of successive time points.
Under assumption (2) and for phi = (lambda, phi), the log-likelihood is
l(theta|Y) = -T/2 log |A SIGMA A'| -1/2 SIGMA(y[sub t] - A lambda)'(A SIGMA A')[sup -1](y[sub t] - A lambda).
Let W = A'(A SIGMA A')[sup -1] A with the ijth element w[sub ij], and sigma[sup 2, sub i] respectively denote the first and second derivatives of sigma[sup 2](lambda) evaluated at lambda[sub i]. Under assumption (4),
[Multiple line equation(s) cannot be represented in ASCII text].
The I x I Fisher information matrix -E(Differential[sup 2]l/differential lambda[sup 2]) for lambda has entries:
(5) -E(differential[sup 2]l/differential lambda[sub i]differential lambda[sub j]) = T(w[sub ij] + 1/2 phi[sup 2]sigma[sup 2, sub i]sigma[sup 2, sub j]w[sup 2, sub ij]),
which provides insight into the source of information in estimating lambda. The first term is the information about lambda that would come from a model in which El had no relation to lambda. The second term brings in the additional information about lambda that is provided by the covariance of y[sub t]. Because rank(W) = rank(A) and A is generally far from full column rank, the Fisher information matrix about lambda would be singular without the second term of (5). Thus it is the covariance of y[sub t] that provides the crucial information needed in estimating lambda. This is reminiscent of the role that the covariance played in proving identifiability. In fact, it can be proved that the matrix with entries w[sup 2, sub ij] is positive definite as long as lambda > 0, and this implies the Fisher information matrix is nonsingular.
Because SIGMA is functionally related to lambda, there are no analytic solutions to the likelihood equations. We turn to derivative-based algorithms to find the maximum likelihood estimate (MLE) of the parameters phi and lambda, subject to the constraints phi > 0 and lambda > 0. We find it useful to start the numerical search using the EM algorithm (Dempster, Laird, and Rubin 1977), with the complete data defined as the T unobserved byte count vectors X = (x[sub 1],...,x[sub T]) of the OD pairs. Note that the complete data log-likelihood is the familiar normal form
[Multiple line equation(s) cannot be represented in ASCII text].
Let theta[sup (k) be the current estimate of the parameter theta. Quantities that depend on theta[sup (k)] are denoted with a superscript (k). The usual EM conditional expectation function Q is
[Multiple line equation(s) cannot be represented in ASCII text],
where
(6) m[sup (k), sub t] = E(x[sub t]|y[sub t], theta[sup (k)]) = lambda[sup (k) + SIGMA[sup (k)] A'(A SIGMA[sup (k)] A')[sup -1](y[sub t] - A lambda[sup (k)]),
R[sup (k), sub t] = var(x[sub t]|y[sub t], theta[sup (k)]) = SIGMA[sup (k) - SIGMA[sup (k)] A'(A SIGMA[sup (k)] A')[sup -1] A SIGMA[sup (k)])
are the conditional mean and variance of x[sub t] given both y[sub t] and the current estimate theta[sup (k)]. To complete the M step, we need to maximize the Q function with respect to theta. Let
[Multiple line equation(s) cannot be represented in ASCII text].
It can be shown that equations differential Q/differential theta = 0 are equivalent to
(7) {0 = c theta lambda[sup c, sub i] + (2 - c) lambda[sup 2, sub i] -2(1 - c) lambda[sub i]b[sup (k), sub i] - ca[sup (k), sub i]
{0 = SIGMA[sup I, sub i = 1] lambda[sup -c+1, sub i] (lambda[sub i] -b[sup (k), sub i)
for i = 1,...,I.
The quantities a[sup (k), sub i] are non-negative by definition and, with this in mind, it is straightforward to show that nonnegative solutions lambda and phi to equation (7) always exist, even though they must generally be found numerically. Let f(theta) = (f[sub 1](theta),...,f[sub I+1](theta))' be the right-hand sides of the previous equations. We shall use the one-step Newton-Raphson algorithm to update theta[sup (k)] (Lange, 1995):
theta[sup (k+1)] = theta[sup (k)] - [F(theta[sup (k))][sup -1]f(theta[sup (k)]),
where F is the Jacobian of f(theta) with respect to theta:
differential f[sub i]/differential lambda[sub j] = delta[sub ij](phi c[sup 2] lambda[sup c-1, sub i] + 2(2 - c)lambda[sub i] - 2(1 - c)b[sup (k), sub i]
differential f[sub I+1]/differential lambda[sub j] = (2 - c)lambda [sup 1-c, sub j] - (1 - c) lambda[sup -c, sub j]b[sup (k), sub j]
differential f[sub i]/differential phi = c lambda[sup c, sub i], differential f[sub I+1]/differential phi = 0
for i,j - 1,..., I. In general, the Newton-Raphson steps do not guarantee theta[sup (k) > 0 but in the special case of c = 1 or c = 2, given phi, we can explicitly find a positive solution to (7) and use fractional Newton-Raphson steps on phi when necessary to prevent negative solutions.
Convergence of the previous modified EM algorithm has been proved (Lange 1995). However, it is usually quite slow in practice due to the sublinear convergence of the EM algorithm (Dempster et al. 1977). Second-order methods based on quadratic approximations of the likelihood surface have faster convergence rates. There are many such algorithms and we use the one in the S function ms (Chambers and Hastie 1993), which is based on the published algorithm of Gay (1983). Our implementation uses analytical derivatives of the likelihood surface up to second order. Based on this information, the algorithm derives a quadratic approximation to the likelihood surface and uses a model trust region approach in which the quadratic approximation is only trusted in a small region around the current search point. With this algorithm, as with the EM algorithm, the likelihood function increases at each subsequent iteration. Because this algorithm is designed for unconstrained optimization, we reparameterize the likelihood function using eta = (log(lambda), log(phi)) and supply to ms the first derivatives and second derivatives in terms of eta. We summarize the numerical algorithm as follows: (a) Initialize theta = theta[sub 0]; (b) Update theta using Newton-Raphson EM steps until the change in l(theta) is small; (c) Update theta using a second-order method (like ms in S) until convergence is declared.
The choice of starting point is fairly arbitrary for the EM iterations. For lambda, we use a constant vector in the center of the region; that is, lambda[sub 0] = a[sub 0]1 where a[sub 0] solves 1' A lambda[sub 0] = 1' SIGMA[sup T, sub 1] y[sub t]/T. In the case of c = 1, phi = var(y[sub t,j])/E(y[sub t,j]) for any j = 1,...,J. A moment estimator based on this is used for the starting value of phi; similar ideas are used to give starting values of phi in general cases of c. Our experience is that this easily computed starting point gives stable performance. A more complex choice is to use a moment estimator like that proposed by Vardi (1996), but for the mean variance relation (4). Computations take 6 seconds per MLE window on the Router1 network with 16 OD pairs using S-PLUS 3.4 on a shared SGI Origin 2000 with 200 MHz processors. The computations to produce Figure 5, for example, take 30 minutes. This could be reduced considerably by computing in a compiled language such as C.
2.4 Estimation of X Based on lid Observations
If theta is known, then E(X|Y, theta, X > 0) has minimum mean square prediction error for estimating X. Conditioning on X > 0 safeguards against negative estimates. When theta is unknown, a natural alternative is
X = E(X|Y, theta = theta, X > 0),
where theta is the MLE of theta based on Y. Because the samples are independent, the nth column of X is equal to
(8) x[sub t] = E(x[sub t]|y[sub t], theta, x[sub t] > 0).
But [x[sub t]|y[sub t], theta] is multivariate normal, and thus computing x[sub t] generally requires multidimensional integration over the positive quadrant. To avoid this, we reason as follows: if the normal approximation for x[sub t] is appropriate, the positive quadrant will contain nearly all the mass of the distribution of [x[sub t]|y[sub t], theta] and thus conditioning on x[sub t] > 0 has only a small effect. The more crucial matter is to satisfy the constraint Ax[sub t] = y[sub t], for which an iterative proportional fitting procedure is well suited. It is natural to ask here whether this summation constraint together with the positivity constraint would be adequate to determine x[sub t] without our proposed statistical model. The answer is in general "no" and we will return to this point in the validation section.
Iterative proportional fitting is an algorithm widely used in the analysis of contingency tables. It adjusts the table to match the known marginal totals. The algorithm and its convergence have been the subject of extensive study (Csiszar 1975; Deming and Stephan 1940; Ireland and Kullback 1968; Ku and Kullback 1968). For a one-router network, the linear constraints Ax[sub t] = y[sub t] with positive x[sub t] can be re-expressed in a contingency table form if we label the table in both directions by the nodes in the network, with the rows as the origin nodes and the columns as destinations. Each entry in the table represents the byte count for an OD pair. In this case, the constraint translates exactly into column and row margin summation constraints. For general networks, Ax[sub t] = y[sub t] corresponds to further summation constraints beyond the marginal constraints, but the iterative proportional fitting algorithm can easily deal with these additional constraints in the same way as the marginal constraints. As long as the space {x[sub t] : y[sub t] = Ax[sub t], x[sub t] >/= 0} is not empty, positivity of the starting point is a sufficient condition for convergence.
To give a positive starting value, we use a componentwise version of (8)
(9) [Multiple line equation(s) cannot be represented in ASCII text]
and then we adjust the resulting vector x[sup (0), sub t] = (x[sup (0), sub t,1],..., x[sup (0), sub t,I]) using an iterative proportional fitting procedure to meet the constraint Ax[sub t] = y[sub t]. The quantities x[sup (0), sub t,i] can be computed based on the following easily derived formula for a Gaussian variate Z Is similar to normal(mu, sigma[sup 2]): E(Z|Z > 0) = mu + sigma/Square root of 2 pi exp(-mu[sup 2]/(2 sigma[sup 2]))PHI[sup -1](mu/sigma), where PHI(.) is the standard normal cumulative distribution. The conditional mean and variance of [x[sub t,i]|y[sub t], theta] can be found using (5).
Our iterative proportional fitting procedure starts with x[sup (0), sub t] from (9) and then sweeps cyclically through the constraint equations as follows. For each row a[sub j] of A, (j = 1,...., J) obtain x[sup (j), sub t] by multiplying the components of x[sup (j), sub t] corresponding to nonzero elements of a[sub j] by the factor y[sub t,j]/a[sub j]x[sup (j-1), sub t]. Then set x[sup (0), sub t] = x[sup (J), sub t] and repeat. Convergence is declared when all constraints are met to a given numerical accuracy.
Recall from Figure 2 that the links exhibit outbursts of traffic interleaved with low activity intervals. Changes in the local average byte counts are obvious and a time-varying approach is needed to model this data.
3.1 A Local IID Model
To allow for parameters that depend on t, the basic lid model is extended to a local lid model using a moving window of a fixed size w = 2h + 1, where h is the half-width. For estimation at time t, the window of observations centered at t are treated as iid:
(10) y[sub t-h],..., y[sub t+h] Is similar to iid normal(A lambda[sub t], A SIGMA[sub t] A')
where SIGMA[sub t] = phi[sub t] diag (lambda[sup c, sub t]). This is the moving lid version of model (2)-(4) and the methods of the previous section are used to obtain estimates lambda[sub t] and x[sub t]. Because consecutive windows are overlapping, estimates lambda[sub t] and x[sub t] are implicitly smoothed.
The iid assumption for consecutive x[sub t] vectors within a window is approximate with respect to both independence and identical distributions. The assumption makes our method simple. By choosing a relatively small window size, the method can effectively adapt to the time-varying nature of the data. A moving window approach formally falls under local likelihood analysis (Hastie and Tibshirani 1990; Loader 1999), with a rectangular weight function. Within that framework, equation (10) can be justified as a local likelihood approximation to a time-varying model with observations that are independent over time.
3.2 Exploratory Analysis and Model Checking
Before estimating parameters in the Router1 network, we do some exploratory analysis to check appropriateness of the various assumptions made about x[sub t]. Because x[sub t] is not observed, however, these checks must be based only on y[sub t]. In what follows, m[sub t] and s[sub t] denote the expected value and component-wise standard deviation of y[sub t].
3.2.1 IID Normality. Using the switch link as a representative interface, Figure 3 shows a normal probability plot (left) and a time series plot (right) for standardized residuals of byte counts originating at switch. The vector of residuals for all links at time t is defined as e[sub t] = (y[sub t] - m[sub t])/s[sub t], where m[sub t] and s[sub t] are the sample mean and standard deviation over a window of size 11 centered at t. Each element of e[sub t] should be approximately iid normal over time. The probability plot is somewhat concave and this is typical of the other links as well, implying that the actual count distributions have heavier upper tails than the normal. The agreement is sufficient, however, for a normal-based modeling approach to be meaningful.
Independence over time is regarded as an effective assumption and the time-series plot of residuals in the right-hand panel is meant only to check for gross violations. For a small window size, the link counts have more or less constant mean and variance unless the window happens to include an abrupt change for one or more links; in this case we have to live with the consequences of our model being wrong. This point is addressed further in section 3.3. Large windows including several days, for example, might contain diurnal patterns that would render the iid assumption unreasonable. With w = 11 corresponding to a 55-minute window, the local dependence is negligible.
3.2.2 Window Size. In Section 3.3, the window w = 11 is chosen. Windows of 5 and 21 yielded similar final results. It is possible to follow the guidelines in Loader (1999) to choose the window size via cross validation (CV) to minimize the mean square estimation error for x[sub t], but we have not done this.
3.2.3 Variance-Mean Relationship. Identifiability requires a relation between the mean and variance of OD byte counts. If the relation is unknown, however, uncovering it remains difficult. In section 2, we proposed a power relation as a simple extension to the Poisson count model that allows overdispersion and a superlinear increase in the variance with the mean. We now attempt to check this assumption.
Figure 4 plots log s[sup 2, sub t,j] against log m[sub t,j] (j = 1,..., 8) using the local averages and standard-deviations described previously for checking the normal assumption. The two panels represent windows centered at different time periods: t = 11:30 AM and 3:30 PM. Each point in the plot represents a given network link. According to the arguments that follow, a power relation SIGMA[sub t] = phi[sub t]diag (lambda[sup c, sub t]) will tend to produce a linear plot with slope equal to c. Rough linearity is evident in the figure and slopes of the solid fitted lines indicate that a quadratic power law (c = 2) is more reasonable than a linear law (c = 1). The two panels shown are typical of most cases.
Suppose that only the first K of the OD pairs contribute significantly to the byte counts on the jth link at time t. Then for c is greater than or equal to 1 and K is greater than or equal to 1,
K[sup 1-c](lambda[sub 1] +...+ lambda[sub K])[sup c] less than or equal to lambda[sup c, sub 1] +...+ lambda[sup c, sub K] less than or equal to (lambda[sub 1] +...+ lambda[sub K])[sup c].
Thus the variance s[sup 2, sub t,j] and mean m[sub t,j] of the byte counts for link j satisfy
log s[sup 2, sub t,j] = log phi + c log m[sub t,j] + a[sub t,j]
where a[sub t,j] is bounded between zero and (l-c) log K. If byte counts on most of the link interfaces are dominated by a few OD pairs, then (1-c) log K is likely to be small in comparison to the variation of c log m[sub t,j] and thus log s[sup 2, sub t,j] will be approximately linear in log m[sub t,j] with a slope c. When c = 1, a[sub t] Is equivalent to 0 and this linearity holds exactly.
It may be possible to estimate c directly from the data but identifiability for this general case model is not obvious. Therefore, we select a value for c from from a limited number of candidates. Because these models all have the same complexity, our selection procedure compares the maximum log-likelihood for each window and selects the one that on average gives the largest maximum log-likelihood.
3.3 Local Fits to Router1 Data
3.3.1 Estimates of lambda. We fit models with c = 1 and c = 2 using a moving window of width w = 11. Comparing the fits, 98% of windows give larger likelihood for c = 2 even though the difference in fit is typically small.
Figure 5 plots estimates of lambda (cyan) for the Router1 network. For comparison, the top and right marginal panels also show 11-point moving averages of the observed byte counts y[sub t], in black. Obviously, these moving averages do not rely on a relation between mean and variance, but they are, nevertheless, in general agreement with the model-based estimates. If the match was poor, we would question the power-law specification. The model-based margins are more variable the moving averages. This is most obvious at the high peaks in Figure 5, but magnifying the vertical scale by a factor of 20 (not shown) would demonstrate that the difference in smoothness holds at smaller scales as well. Discussion of estimates x[sub t] is covered in section 5, following the modeling refinements of section 4.
3.3.2 Difficulties with Strictly Local Fitting. An intrinsic problem with fitting a large number of parameters using a small moving window is that the likelihood surface can have multiple local maxima and can be conditioned poorly in directions where the data contains little information on the parameters. This can lead to estimation inaccuracies and numerical difficulties. Some of the smaller scale variations of lambda[sub t] in Figure 5 are likely due to poor conditioning.
An example of a numerical problem comes when the iterative procedure for estimating lambda[sub t] converges to a boundary point in which some components are 0. As estimation is done in terms of log lambda[sub t], the likelihood surface is flat in directions corresponding to 0s. Additional singularity arises when the number of nonzero components becomes smaller than the number of independent link interfaces.
The following section presents a refinement to the local model with the intent of overcoming numerical problems and, more importantly, improving estimation accuracy in parameter directions that are poorly determined from small data windows. Our approach encourages smoother parameter estimates by borrowing information from neighboring data windows through the use of a prior derived from previous estimates. The refinement is especially useful for large networks with many more parameters than observed link interfaces.
Let eta[sub t] = (log(lambda[sub t]), log(phi[sub t])) be the log of the parameter time series. We model eta[sub t] as a multi-dimensional random
walk:
(11) eta[sub t] = eta[sub t-1] + v[sub t], v[sub t] Is similar to normal (0, V),
where V is a fixed variance matrix chosen beforehand. Equations (11) and (10) compose a state-space model for Y[sub t] = (y[sub t-h],...,y[sub t],...,y[sub t+h]). The choice of V is important because it determines how much information from previous observations carries over to time t. Let Y[sub t] = (y[sub 1],..., y[sub t+h]) be all the observations up to time t+h, then
p(eta[sub t]|Y[sub t]) = p(eta[sub t]|Y[sub t-1], Y[sub t]) proportional to p(eta[sub t]|Y[sub t-1])p(Y[sub t]|eta[sub t]).
Thus maximizing the posterior p(eta[sub t]|Y[sub t]) is equivalent to maximizing log-likelihood with an additive penalty corresponding to the adaptive log-prior on eta[sub t] conditioned on past data. The penalty term conditions nearly flat directions on the original log-likelihood surface which can otherwise produce poor estimates of lambda.
For the prior on eta[sub t] we have
p(eta[sub t]|Y[sub t-1]) proportional to Integral of p(eta[sub t-1]|Y[sub t-1)p(eta[sub t]|eta[sub t-1])d eta[sub t-1].
but to relieve computational burden, we approximate the posterior p(eta[sub t-1]|Y[sub t-1]) at t - 1 by normal(eta[sub t-1], SIGMA[sub t-1]), where eta[sub t-1] is the posterior mode and SIGMA[sub t-1] is the inverse of the curvature of the log posterior density at the mode (Gelman, Carlin, Stem, and Rubin 1995). Hence p(eta[sub t]|Y[sub t-1]) can be approximated by normal(eta[sub t-1], SIGMA[sub t-1] + V). With this prior, optimization of p(eta[sub t]|Y[sub t]) can be handled in a manner similar to that outlined in section 2.3. For the EM algorithm, the Q function is modified by including an extra constant term from the prior and hence only the M step has to be changed. If the EM algorithm is used only to provide a good starting point for a second-order method, it may not even be necessary to modify the M step. To use a second-order optimization routine (like ms in S), all the derivative calculations are modified by including the derivatives of the log-prior, log(p(eta[sub t]|Y[sub t-1])), and hence can be carried out just as easily as before. As an end result, using a normal approximation to the adaptive prior adds almost no additional computational burden for parameter estimation. The final algorithm is as follows.
1. Initialize t = h + 1, eta[sub t-1] = eta[sub 0] and SIGMA[sub t-1] = SIGMA[sub 0].
2. Let SIGMA[sub t|t-1] = SIGMA[sub t-1]) be the prior of eta[sub t] given observations Y[sub t-1].
3. Let g(eta[sub t]) = log pi(eta[sub t]) + log p(Y[sub t]|eta[sub t]) be the log of the posterior distribution of eta[sub t]. Find the mode eta[sub t] = argmax g(eta[sub t]) using optimization method. Note that derivatives of g are g(eta[sub t]) = - SIGMA[sup -1, sub t|t-1] (eta[sub t] - eta[sub t-1]) + differential log p/differential eta[sub t], and g(eta[sub t]) = -SIGMA[sup -1, sub t|t-1] + differential[sup 2] log p/differential eta[sup 2, sub t].
4. Let SIGMA[sub t] = g(eta[sub t])[sup -1], t = t + 1 and return to Step 2.
The proposed model and algorithm are similar to Kalman Filtering (Anderson and Moore 1979) and Bayesian dynamic models (West and Harrison 1997). The effect of the adaptive prior pi(eta[sub t]) on estimation and hence the smoothness is controlled by the size of V. If a relatively large V is chosen, pi(eta[sub t]) only plays a secondary role. In comparison, the choices of eta[sub 0] and SIGMA[sub 0] are less important in the sense that their effects die out with time. In our implementation, both V and eta[sub 0] are set empirically from preliminary parameter estimates obtained in section 2, and a large SIGMA[sub 0] is chosen to reflect poor prior knowledge at the start of estimation.
Figure 6 shows estimates of lambda[sub t] using the adaptive prior (magenta) and compares them to previous estimates (cyan) with no prior. The vertical scale is magnified 20 times over that of Figure 5. As desired, the prior-based estimates are clearly smoother than those from strict local fitting. Moreover, in the upper and right marginal panels, the marginal sums of the new estimates (magenta) are less variable than those of the old (cyan).
The ultimate goal is to estimate the actual time-varying traffic, x[sub t], and assess the accuracy of these estimates as well as the fit of the model. But standard residual analyses are not available because x[sub t] is unobservable and the fitting procedure provides an exact fit to the link observations: y[sub t - Ax[sub t] = 0. Thus, to validate the modeling approach, we have instrumented the Router1 network to directly measure complete OD byte counts x[sub t] and not merely link measurements y[sub t]. This section compares estimates x[sub t] with actual OD traffic.
Measuring OD traffic on a LAN generally requires special hardware and software. Router1 is a Cisco 7500 router capable of generating data records of IP (internet protocol) flows using an export format called netflow. These records were sent to a nearby workstation running cflowd software (CAIDA 1999) that builds a database of summary information in real time. Finally, we ran aggregation queries on the flow database to calculate OD traffic matrices for the Router1 network. Queries were run automatically at approximate 5-minute intervals in an effort to match with the routine link measurements studied in previous sections.
In principle, marginal sums from the netflow data should match the link measurements over identical time intervals. Actual measurements inevitably have discrepancies due to timing variations and slight differences in how bytes are accumulated by the two measurement methods. Let x[sub t] represent OD traffic measured from netflow records for the 5-minute interval ending at time t. Link measurements corresponding to x[sub t] are calculated as y[sub t] = Ax[sub t] and then these are used to form estimates, x[sub t], of the OD traffic. Fitted values in this section are based on the adaptive prior model of section 4.
Full-scale plots (not shown) show excellent agreement between actual x[sub t] and predicted x[sub t] OD traffic. Large OD flows, corresponding to periods of high usage, are estimated with very small relative error. A more critical view of the estimates is seen in Figure 7, where the vertical axis is magnified by a factor of 20 so that the large well-estimated peaks are off-scale. The figure focuses on estimation errors for the smaller-scale features. In particular, we sometimes predict a substantial amount of traffic when the actual amount is zero. This is especially apparent in the lower right panel labeled fddi arrow right corp where the traffic is greatly overestimated (relative to the actual low traffic). Due to the constraint that fitted margins must match the data, overestimates in one panel are compensated by underestimates in another in which positive errors are balanced by negative errors in each row and column. Perhaps additional smoothing from the prior would be useful. We have considered setting the prior variances and local window width using a cross-validation approach but we have not pursued this possibility in depth.
Although estimation errors are sometimes large, the model-based estimates perform well when compared to the range of all possible OD estimates that are both nonnegative and consistent with the observed link data. As an example, at 3:30 AM we compute actual estimation errors for each of the 16 OD pairs and divide by the range of all possible estimates. Nine of these ratios are less than .14% and all are less than 8%. By this measure, the statistical model is contributing a tremendous amount to the estimation accuracy.
Figure 8 shows a two-router portion of the larger Lucent network depicted in Figure 1. The routers, labeled Router4 and Gateway, each support four edge nodes and one internal link. This gives a total of 20 one-way link interfaces. Router4 serves one organization within Lucent, and Gateway is the router with a connection to the public Internet. The edge nodes represent subnetworks. Applying our methods to this somewhat more complex topology demonstrates that the approach is computationally feasible, though not fully scalable to larger problems. Experience with this network also motivates some directions for future work.
Among the 20 link count measurements there are, in principle, four linear dependencies corresponding to the balance between incoming and outgoing byte counts around each router and to the fact that traffic measurements between Router4 and Gateway are collected at interfaces on both routers. In reality there are inevitable discrepancies in the counts--up to 5%, for example, between total incoming and total outgoing traffic at these routers. We reconcile these inconsistencies before estimating traffic parameters.
Suppose there are n redundancy equations arising from the network topology:
a'[sub i]y[sub t] = b'[sub i]y[sub t], i = 1,...,n,
where a[sub i], b[sub i] are 0-1 vectors with a'[sub i]b[sub i] = 0. When measurements do not provide exact equality, we rescale them using the following iterative proportional adjustment method.
Without loss of generality, assume b'[sub i]y[sub t] is greater than or equal to a'[sub i]y[sub t]. Moving cyclically through i = 1,..., n, multiply the components of y[sub t] corresponding to nonzero elements of a[sub i] by the factor (b'[sub i]y[sub t])/(a'[sub i]y[sub t]). Iteration stops when (b[sub i] - a[sub i])'y[sub t] is sufficiently close to 0. Such reconciliation makes adjustments proportional to the actual link measurements, which is an intuitively fair approach. Following adjustment, we remove redundant elements from both y[sub t] and A and then locally fit the iid normal model obtaining estimates of both lambda[sub t] and x[sub t].
Estimates lambda[sub t] for a recent 1-day period are shown in Figure 9 on a scale of 0 to 100K bytes/sec, which clips off some of the highest peaks in the data. Qualitatively, we expected to see lower traffic levels in the upper-left and especially lower-right blocks of panels corresponding to OD pairs on different routers. We also anticipated that only a few pairs would usually dominate the data flow. This tends to make the estimation problem easier because simultaneous peaks in a single origin and single destination byte count can be attributed to a single OD pair with relatively little ambiguity.
Computation time for this network with 64 OD pairs was 35 sec per estimation window, which compares to 6 sec for the Router1 network with 16 OD pairs. In general, the number OD pairs can grow as fast as the square of the number of links.
Scaling these methods to work for larger problems is an area of current research. Our identifiability result implies that for a given OD pair, byte counts from only the two edge-nodes are needed to estimate the intensity of traffic between them. All of the other byte counts provide supplementary information. This suggests handling large networks as a series of subproblems, each involving a given subset of edge nodes and simplifying the rest of the network by forming aggregate nodes. The two-router example of Figure 8 could represent one such subproblem for analyzing the complete Lucent shown network of Figure 1(a). We are presently studying different schemes for dividing a large problem into manageable pieces without sacrificing too much estimation accuracy.
Practical realities dictate that information needed for managing computer networks is sometimes best obtained through estimation. This is true even though exact measurements could be made by deploying specialized hardware and software. We have considered inference of origin-destination byte counts from measurements of byte counts on network links such as can be obtained from router interfaces. All commercial routers can report their link counts whereas measuring complete OD counts on a network is far from routine.
Using a real-life example, we have shown that OD counts can be recovered with good accuracy relative to the degree of ambiguity that remains after marginal and positivity constraints are met. We model the counts locally as iid normal conditional on the mean. Identifiability of the OD parameters from link data requires a relation between means and variances of OD counts. A power-law family provides a model that is a natural extension of the normal approximation to the Poisson and incorporates extra-Poisson variation observed in sample data.
Simple local likelihood fitting of an lid model is not sufficient because large fitting-windows smooth sharp changes in OD traffic, yet small windows cause estimates to be unreliable. A refinement in which the logs of positive parameters are modeled as random walks penalizes the local likelihood surface enough to induce smoothness in parameter estimates while not unduly compromising their ability to conform to sharp changes in traffic. Section 4 developed a fully normal approximation to this approach and demonstrated how effectively this approach recovers OD byte counts for our Router1 network.
There are several possible directions for further work. First and most important is developing an efficient algorithm for large size networks, as discussed briefly in section 6. Second, we would like to study how accuracy of traffic estimates for different OD pairs is affected by the network topology. Third, we would like to fold measurement errors into the data model rather than reconciling them up front and then proceeding as if there were no such errors. Additional topics are cross-validation for window sizes and prior variances, modeling changes in lambda[sub t] with more realism than the random walk, and replacing the normal distribution with a heavy-tailed positive distribution.
Progress in these areas would be especially helpful for analyzing large networks but the basic approach we have outlined is still appropriate. Our methods can easily be used on portions of networks with a small-to-moderate number of edge nodes. Giving LAN administrators access to OD traffic estimates provides them with a much more direct understanding of the sizes and timing of traffic flows through their networks. This is an enormous help for network management, planning, and pricing.
DIAGRAM: Figure 1. Two Computer Networks. (a) A router network at Lucent Technologies; (b) the network around Router 1.
GRAPH: Figure 2. Link Measurements for the Router1 Subnetwork From Figure l (b). The average number of bytes per second is measured over consecutive 5-minute intervals for the 24-hour period of February 22, 1999. Each panel represents a node. Origin (magenta) and destination (cyan) series are superposed. Matching patterns suggest traffic flows. For example, at hour 12, the origin corp pattern matches that for destination switch.
GRAPHS: Figure 3. IID Normal Assumptions. Left: normal probability plot of locally standardized residuals from previous origin switch link measurements. The distribution has a somewhat heavier tail than the normal Right: Time-series plots of locally standardized residuals for the same origin switch link.
GRAPH: Figure 4. Local Variances, s[sup 2, sub t], Versus Local Means, m[sub t], for Router1 Link Measurements. Each panel has 8 points, one for each source and destination link. The solid line is a linear fit to the points; the dashed lines have slopes c = 1 and c = 2 corresponding to different power law relations. The quadratic power law, c = 2, is a better fit.
GRAPH: Figure 5. Mean Traffic Estimates, lambda[sub t], for all OD Pairs. Marginal panels on the top and right compare model-based estimates of the mean link traffic E(x[sub t]) = A lambda[sub t] (cyan) with moving averages of the observed link measurements (black).
GRAPH: Figure 6. Comparison of Mean OD Traffic Estimates lambda[sub t] Obtained Comparing the Original Local Model (cyan) and the Refined Model (magenta). Estimates from the refined model, incorporating an adaptive prior, are smoother, especially near zero.
GRAPH: Figure 7. Validation Measurements Versus Estimates. Measurements x[sub t] (black) are plotted over estimates, fit (magenta). The resolution on the vertical scale highlights artifacts of the estimation procedure that occur especially when the actual traffic is near zero. The panel for fddi arrow right corp in the lower right is particularly bad. The marginal plots of x[sub t] and x[sub t] match exactly, as required by the fitting algorithm.
DIAGRAM: Figure 8. Two-Router Network at Lucent Technologies
GRAPH: Figure 9. Traffic Estimates, x[sub t], for all OD pairs in the Two-Router Network. The largest spikes have been clipped by the vertical scale. Major features of the estimates are consistent with usage.
Anderson, B. D. O., and Moore, J. B. (1979), Optimal Filtering, Englewood Cliffs, NJ: Prentice-Hall.
Chambers, J. M., and Hastie, T. J., (eds.) (1993), Statistical Models in S, Pacific Grove, CA: Chapman & Hall.
Cooperative Association for Internet Data Analysis (CAIDA). (1999), Cflowd Flow Analysis Software (version 2.0), Author. www.caida.org/ Tools/Cflowd.
Csiszar, I. (1975), "I-divergence geometry of probability distributions and minimization problems," The Annals of Probability, 3(1), 146-158.
Deming, W. E., and Stephan, F. F. (1940), "On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known" Annals of Mathematical Statistics, 11,427-444.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Likelihood From Incomplete Data via the EM Algorithm, With Discussion," Journal of Royal Statistical Society, Ser. B, 39, 1-38.
Gay, D. M. (1983), "Algorithm 611: Subroutines for Unconstrained Minimization Using a Model/Trust-Region Approach," ACM Trans. Math. Software, 9, 503-524.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, Chapman & Hall.
Hastie, T. J., and Tibshirani, R. J. (1990), Generalized Additive Models, London: Chapman & Hall.
Ireland, C. T., and Kullback, S. (1968), "Contingency Tables With Given Marginals," Biometrika, 55, 179-188.
Ku, H. H., and Kullback, S. (1968), "Interaction in Multidimensional Contingency Tables: An Information Theoretic Approach," Journal of Research of the National Bureau of Standards, 72, 159-199.
Lange, K. (1995), "A Gradient Algorithm Locally Equivalent to the EM Algorithm," Journal of the American Statistical Association, 57(2), 425-437.
Loader, C. (1999), Local Regression and Likelihood, New York: Springer-Verlag.
Tebaldi, C., and West, M. (1998), "Bayesian Inference on Network Traffic Using Link Count Data," Journal of the American Statistical Association, 93(442), 557-576.
Vanderbei, R. J., and Iannone, J. (1994), "An EM Approach to OD Matrix Estimation," Technical Report SOR 94-04, Princeton University.
Vardi, Y. (1996), "Network Tomography: Estimating Source-Destination Traffic Intensities From Link Data," Journal of the American Statistical Association, 91,365-377.
West, M., and Harrison, J. (1997), Bayesian Forecasting and Dynamic Models, New York: Springer-Verlag.
Proof of Theorem 1
From the basic model (2)-(4), it is easy to see that given two parameter sets (lambda, phi) and (lambda, phi), the model is identifiable if and only if the conditions
(A.1) A lambda = A lambda,
phi A diag (lambda[sup c])A' = phi A diag(lambda[sup c])A'
imply the equalities
Note that B has full column rank if and only if Adiag (x)A' = 0 has only the solution x = 0. Thus if B has full column rank, the second condition of (A.1) implies phi lambda[sup c] = phi lambda[sup c], or equivalently, a lambda = lambda where a[sup c] = phi/phi > 0. Substituting this into the first condition of (A. 1) gives
A lambda - A lambda = (1 - a)A lambda = 0.
Let 1 be a row vector of ones with length or and note that 1'A gives the column sums of A. Premultiply on both sides of the previous equation by 1 to give (1 - a)(1'A)lambda = 0. If B has full column rank, it follows that the column sums of A are always positive. Therefore, if lambda[sub i] is greater than or equal to 0 with at least one strict inequality, we have a = 1 implying both lambda = lambda and phi = phi. On the other hand, if B does not have full column rank, then Adiag (x)A' = 0 has a nonzero solution and thus, condition (A. 1) with phi = phi does not imply lambda = lambda.
Proof of Corollary
We shall show that the matrix B has full column rank in this case. Note that A is a d x I matrix of Os and l s whose rows represent how the link variables, y, are summed up from the OD variables x. For the ith OD pair, we focus on two link interfaces--one corresponds to the first interface on which traffic from the origin enters the network and the other to the last interface before the traffic leaves the network to the destination. Suppose these two links correspond to two rows r[sub 1] and r[sub 2] of A. Multiplying them component-wise produces a row vector of B. Moreover, row r[sub 1] has 0 entries everywhere except for the OD pairs whose origin matches that of the ith OD pair. Similarly, r[sub 2] has 0 entries everywhere except for the OD pairs whose destination matches that of the ith interface. It follows that the only component where they share a 1 is the ith. Hence the product vector has entries 0 everywhere except for the ith component where it has entry 1. Thus the rows of B from such chosen component-wise products form an identity matrix with rank I and B has I columns. Hence B is full column rank.
[Received July 1999. Revised June 2000.]
~~~~~~~~
By Jin Gao; Drew Davis; Scott Vander Wiel and Bin Yu
Jin Cao, Drew Davis, and Scott Vander Wiel are Members of Technical Staff at
Bell Laboratories, Lucent Technologies. Bin Yu is Member of Technical Staff at
Bell Laboratories, Lucent Technologies and Associate Professor of Statistics at
University of California at Berkeley. We thank Tom Limoncelli and Lookman Fazal
for system administration support in setting up MRTG and Netflow data collection. We also thank Debasis Mitra for
pointing us to this problem and Yehuda Vardi and Mor Armony for engaging
discussions. (E-mail: cao@research.bell-labs.com).
Title: | P Values for Composite Null Models. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | The problem of investigating compatibility of an assumed model with the data is investigated in the situation when the assumed model has unknown parameters. The most frequently used measures of compatibility are p values, based on statistics T for which large values are deemed to indicate incompatibility of the data and the model. When the null model has unknown parameters, p values are not uniquely defined. The proposals for computing a p value in such a situation include the plug-in and similar p values on the frequentist side, and the predictive and posterior predictive p values on the Bayesian side. We propose two alternatives, the conditional predictive p value and the partial posterior predictive p value, and indicate their advantages from both Bayesian and frequentist perspectives. [ABSTRACT FROM AUTHOR] |
AN: | 3851439 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
The problem of investigating compatibility of an assumed model with the data is investigated in the situation when the assumed model has unknown parameters. The most frequently used measures of compatibility are p values, based on statistics T for which large values are deemed to indicate incompatibility of the data and the model. When the null model has unknown parameters, p values are not uniquely defined. The proposals for computing a p value in such a situation include the plug-in and similar p values on the frequentist side, and the predictive and posterior predictive p values on the Bayesian side. We propose two alternatives, the conditional predictive p value and the partial posterior predictive p value, and indicate their advantages from both Bayesian and frequentist perspectives.
KEY WORDS: Bayes factors; Bayesian p values; Conditioning; Model checking; Predictive distributions.
1.1 Background
In parametric statistical analysis of data X, one is frequently working at a given moment with an entertained model or hypothesis H[sub 0]: X Is similar to f(x; theta). We will call this the null model or null hypothesis, even though no alternative is explicitly formulated. We assume that f(x; theta) is either a discrete density or a continuous density (with respect to Lebesgue measure). A statistic T = t(X) is chosen to investigate compatibility of the model with the observed data, x[sub obs]. We assume that T has been expressed in such a way that large values of T indicate less compatibility with the model. The most commonly used measure of compatibility is the p value, defined as
(1) p = Pr(t(X) is greater than or equal to t(x[sub obs])).
When theta is known, the probability computation in (1) is with respect to f(x; theta). The focus in this article is on the choice of the probability distribution used to compute (1) when theta is unknown. In Section 2 we present two new types of p values, which we argue are superior to existing choices. The rest of this section describes the most common of the existing choices. We abuse notation by using f(t; theta) and f(t|u; theta) to denote the marginal density of t(X) and the conditional density of t(X) given u(X) = u.
The most obvious way to deal with an unknown theta in computation of the p value is to replace theta in (1) by some estimate, theta. In this article we consider only the usual choice for theta, namely the maximum likelihood estimator (MLE). We call the resulting p value the plug-in p value (p[sub plug]). Using a superscript to denote the density with respect to which the p value in (1) is computed, the plug-in p value is thus defined as
(2) p[sub plug] = Pr[sup f(x;theta)] (t(X) is greater than or equal to t(x[sub obs])).
The main strengths of p[sub plug] are its simplicity and intuitive appeal. Its main weakness appears to be a failure to account for uncertainty in the estimation of theta, although as we show, this issue is rather involved.
Another natural device for eliminating the unknown theta is to condition on a sufficient statistic, U, for theta. Then f(x|u[sub obs]; theta) does not depend on theta, and computations in (1) can be carried out using the completely specified f(x|u[sub obs]). [In fact, U need only be sufficient for theta with respect to f(t;theta).] We call these p values similar p values, a term borrowed from the related notion of similar tests and confidence regions. A similar p value is thus defined as
3) p[sub sim] = Pr[sup f(x|u[sub obs])] (t(X) is greater than or equal to t(x[sub obs])).
The main strength of p[sub sim] is that it is based on a proper probability computation, which imbues the end result with various desirable properties (discussed later). Its main weaknesses are that the computation can be burdensome and that a suitable sufficient U typically does not exist.
Bayesians have a natural way to eliminate nuisance parameters: integrate them out. Thus if pi(theta) is a prior distribution for theta, then the marginal or (prior)predictive distribution is
(4) m(x) = Integral of f(x; theta) pi (theta) d theta.
Because this is free of theta, it can be used to compute a p value, leading to the prior predictive p value, given by
(5) P[sub prior] = Pr[sup m(x)](t(X) is greater than or equal to t(x[sub obs])).
The main strengths of p[sub prior] are that it is also based on a proper probability computation (at least if pi(theta) is proper), and that it suggests a natural and simple T, namely t(x) = 1/m(x). The main weakness of p[sub prior] for pure model checking is its dependence on the prior pi(theta); in essence, m(x) measures the likelihood of x relative to both the model and the prior, and an excellent model could come under suspicion if a poor prior distribution were used. For this reason, and because model checking is often considered at early stages of an analysis before careful prior elicitation is performed (and/or because a nonsubjective analysis might be desired from the beginning), it is attractive to attempt to use noninformative priors. Unfortunately, noninformative priors are typically improper, in which case the prior predictive m(x) would also be improper, precluding computation of (5). Box (1980) popularized the use of P[sub prior].
The concerns mentioned in the preceding paragraph have led many Bayesians, beginning with Guttman (1967) and Rubin (1984), to eliminate theta from f(x; theta) by integrating with respect to the posterior distribution, pi(theta|x[sub obs]), instead of the prior, before computing a p value. The posterior predictive p value is thus defined as
(6) p[sub post] = Pr[sup m[sub post](x|x[sub obs])] (t(X) is greater than or equal to t(x[sub obs])),
where
(7) m[sub post](x|x[sub obs]) = Integral of f(x; theta) pi (theta|x[sub obs])d theta.
The main strengths of p[sub post] are as follows:
a. Improper noninformative priors can readily be used (since pi(theta|x[sub obs]) will typically be proper).
b. m[sub post](x|x[sub obs]) typically will be much more heavily influenced by the model than by the prior; indeed, as the sample size goes to infinity, the posterior distribution will essentially concentrate at theta, so that p[sub post] will (for large n) be very close to p[sub plug].
c. It typically is very easy to compute using output from modern Markov chain Monte Carlo (MCMC) Bayesian analyses.
Its main weakness is that there is an apparent "double use" of the data in (6), first to convert the (possibly improper) prior pi(theta) into a proper distribution pi(theta|x[sub obs]) for determining the reference distribution m[sub post](x|x[sub obs]), and then to compute the tail area corresponding to t(x[sub obs]). This double use of the data can induce unnatural behavior. From a Bayesian perspective, defenders of the prior predictive also point out that the posterior predictive lacks a pure Bayesian interpretation; although this was our original motivation for the developments herein, the arguments in the article are not directly based on such reasoning.
Generalizations of (6) were considered by Meng (1994), Gelman, Carlin, Stern, and Rubin (1995), Gelman, Meng, and Stern (1996), and references therein; in particular, t(X) could be replaced by a function t(X, theta), and f(x; theta) in (7) could be replaced by f(x|theta, A), where A is some other statistic. We do not discuss such generalizations in this article.
There are also many other related works. Aitkin (1991) used the posterior distribution to compute actual Bayes factors, instead of p values. Evans (1997) introduced a related concept for model checking based on the ratio of the posterior and prior predictive densities.
Other approaches that have been suggested for dealing with the nuisance parameter, theta, in computing (1) include those of Tsui and Weerahandi (1989) (primarily for one-sided testing) and Berger and Boos (1994). The latter authors sought to provide a practical implementation of the conservative frequentist approach that deals with unknown theta by maximization:
[Multiple line equation(s) cannot be represented in ASCII text]
This p value is of rather limited usefulness, because the supremum is often too large to provide useful criticism of the model. For instance, in the examples of Sections 2 and 3, p[sub sup] can easily be seen to equal 1. Berger and Boos (1994) overcame this difficulty by restricting the supremum to theta in a confidence set for theta (with the noncoverage probability being added to the p value). Although potentially useful to frequentists in formal testing situations, in which conservatism is typically deemed desirable, the approach is less appropriate for model checking in which conservatism would mean that one is often not alerted to the fact that the model is inadequate.
1.2 Evaluation of p Values
What do we want in a p value? For a frequentist, one appealing property would be for p, considered as a random variable, to be uniform[0, 1] under the null, f(x; theta), for all theta. In some sense, being U[0, 1] defines a proper p value, allowing for its common interpretation across problems. Statistical measures that lack a common interpretation across problems are simply not very useful. (For more extensive discussion of this point, see the companion article Robins, van der Vaart, and Ventura 2000, which we henceforth denote by RVV 2000; earlier articles that refer to and/or discuss this "defining" property of a p value include De la Horra and Rodriguez-Bernal 1997, Meng 1994, Robins 1999, Rubin 1996, and Thompson 1997.)
For most problems, exact uniformity under the null for all theta cannot be attained for any p value. Thus one must weaken the requirement to some extent. A natural weaker requirement is that a p value be U[0, 1] under the null in an asymptotic sense; this is the subject of RVV (2000). Here we focus on studying the degree to which the various p values deviate from uniformity in finite-sample scenarios.
It is not obvious that Bayesians should be concerned with establishing that a p value is uniform under the null for all theta. For instance, the prior predictive p value is U[0, 1] under m(x) (if the prior is proper), which means that it is U[0, 1] in an average sense over theta. If the prior distribution is chosen subjectively, then a Bayesian could well argue that this is sufficient; indeed, Meng (1994) suggested that uniformity under m(x) is a useful criterion for the evaluation of any proposed p value. (The more basic issue that a p value is a tail area, and not compatible with true Bayesian measures, is discussed briefly in the next section.)
As mentioned earlier, however, preliminary model checking is most typically done (by Bayesians) with noninformative priors, and if these are improper, there is no "average over theta" that can be used. (We later give an example with a proper noninformative prior, in which a Bayesian--or non-Bayesian--might settle for "average" uniformity.) Of course, if a p value is uniform under the null in the frequentist sense, then it has the strong Bayesian property of being marginally U[0, 1] under any proper prior distribution. This explains why Bayesians should, at least, be highly satisfied if the frequentist requirement obtains. Perhaps more to the point, if a proposed p value is always either conservative or anticonservative in a frequentist sense (see RVV 2000 for definitions), then it is likewise guaranteed to be conservative or anticonservative in a Bayesian sense, no matter what the prior. A similar conclusion would hold for large sample sizes (under mild conditions) if a proposed p value were always conservative or anticonservative in a frequentist asymptotic sense. (Interesting related discussions concerning the posterior predictive p value have been given by Gelman et al. 1996, Meng 1994, and Rubin 1996.)
Actually, Bayesians might well go further, not only requiring unconditional uniformity for p values, but also seeking reasonable conditional performance. In this article we limit discussion of this issue to presentation of some examples in which it is clear that study of conditional performance is of value in comparing p values; we do not, however, attempt to present general results in this direction.
There is a vast literature on other methods of evaluating p values. Much of the literature is concerned with power comparisons against alternatives. There is also a significant literature concerned with decision-theoretic evaluations of p values (e.g., Blyth and Staudte 1995; Hwang, Casella, Robert, Wells, and Farrell 1992; Hwang and Pemantle 1997; Hwang and Yang 1997; Schaafsma, Tolboom, and Van Der Meulen 1989; Thompson 1997). Neither of these evaluation techniques is within the scope of this article, because we are specifically concerned with the situation in which no alternative is present (see the next section). From a non-Bayesian perspective, however, evaluation of the new p values by these criteria might well prove very illuminating; see RVV (2000) for interesting results in this direction.
1.3 To Be and Not To Be
This article has five sections. In Section 2 we consider two new p values introduced by Bayarri and Berger (1999), the partial posterior predictive p value (p[sub ppost]) and the conditional predictive p value (P[sub cpred]), and compare them with previous p values in specific examples. Some results are also given in Section 2 concerning equality of various p values; of particular interest is a result (Theorem 2) that allows ready computation of p[sub sim] in certain situations. In Section 3 we compare the various p values in the normal linear model, where exact computation is possible; this section was directly motivated by RVV (2000). In Section 4 we discuss the situation of discrete sample spaces, with emphasis on analysis of contingency tables; this has long been a highly problematic area, with the discreteness of the sample space causing many p values to be very conservative. We present conclusions in Section 5.
A number of relevant issues are not considered in this article. First, we do not explicitly discuss the choice of T, in part because this is a contentious issue and is not directly related to our development; one of the strengths of the methodology that we propose is that it can be applied to essentially any choice of T. Second, our primary focus is on model checking at initial, exploratory stages of the statistical analysis, and consideration of a wide variety of intuitive T is often useful at that stage. If one has a clearly formulated alternative to H[sub 0], then we would not recommend using any p value to perform the test, and would instead use either Bayes factors or conditional frequentist tests (Bayarri and Berger 1999; Berger, Boukai, and Wang 1997; Berger, Brown, and Wolpert 1994; Berger and Delampady 1987; Berger and Sellke 1987; Delampady and Berger 1990; Edwards, Lindman, and Savage 1963). The decision to even formulate an alternative to H[sub 0], however, is often undertaken on the basis of an analysis designed to indicate incompatibility of the model with the data, based on intuitive departure statistics T = t(x). If determination of T required hard work, then we would suggest spending the time instead on actual formulation of the alternative.
The other major issue that we mostly avoid is discussion of measures other than p values of data compatibility with the model. (For a review of a number of other measures that have been proposed, see Bayarri and Berger 1997.) The primary reason for considering only p values is their ubiquitous presence in statistics, together with the fact that they do have some desirable properties (such as invariance to transformations of X). Balanced against this is the near ubiquitous misinterpretation of p values as either frequentist error probabilities or (worse) as the probability of H[sub 0]. Luckily, a rather simple calibration is available that allows p values to be given an intuitive interpretation: compute B(p) = -ep log(p), when p < e[sup -1], and interpret this as the odds (or Bayes factor) of H[sub 0] to H[sub 1], where H[sub 1] denotes the (unspecified) alternative to H[sub 0]. For those who prefer to think in terms of a frequentist error probability a (in rejecting H[sub 0]), the calibration is alpha(p) = (1 + [-ep log(p)][sup -1])[sup -1]. As an example, p = .05 translates into odds B(.05) = .41 (roughly 1-2.5) of H[sub 0] to H[sub 1], and frequentist error probability alpha(.05) = .29 in rejecting H[sub 0].
These calibrations were developed and motivated from a various perspectives by Sellke, Bayarri, and Berger (1999). On the Bayesian side, they arise from robust Bayesian arguments, as lower bounds on Bayes factors for testing H[sub 0]. [B(p) arises exactly in testing against a general nonparametric alternative, and arises approximately in parametric analyses.)] On the frequentist side, alpha(p) arises as a lower bound on the type I error probability, over a large class of conditional frequentist tests, where one conditions on the "strength of evidence" in the data. It is of interest that the calibrations are based in part on starting with a proper p value; that is, a p value that is U[0, 1] in some sense.
Further discussion of some of the philosophical issues surrounding the use of p values has been given by Bayarri and Berger (1999). From now on, we ignore these issues and simply assume that (possibly calibrated) p values are useful, for whatever reasons, and focus on the issue of which p values are most satisfactory.
Section 2.1 introduces the two new p values that we consider and illustrates their definitions (and those of the other p values) in a standard example; some interesting features of the various p values are also observed. Section 2.2 presents the motivations for the new p values from both Bayesian and frequentist perspectives. Section 2.3 addresses computational issues.
2.1 Methodology
Consider first the partial posterior predictive p value, defined for a prior pi(theta) (typically noninformative) as
(8) p[sub ppost] = Pr[sup m(x|x[sub obs]\t[sub obs])] (T is greater than or equal to t[sub obs]);
here T = t(X), t[sub obs] = t(x[sub obs]), and m(x|x[sub obs]\t[sub obs]) and the (assumed proper) partial posterior pi(x|x[sub obs]\t[sub obs]) are given by
m(t|x[sub obs]\t[sub obs]) = Integral of f(t|theta)pi(theta|x[sub obs]\t[sub obs])d theta
and
(9) pi(theta|x[sub obs]\t[sub obs]) proportional to f(x[sub obs]|t[sub obs]; theta) pi (theta) proportional to f(x[sub obs]; theta)pi(theta)/f(t[sub obs]; theta).
Intuitively, this avoids the double use of the data that occurs in the posterior predictive p value, because the contribution of t[sub obs] to the posterior is "removed" before theta is eliminated by integration. (The notation x[sub obs]\t[sub obs] was chosen to indicate this.)
The second p value that we propose is a specific case of what can be termed a U-conditional predictive p value, defined, for some conditioning statistic U = u(X), as
(10) p[sub cpred(u)] = Pr[sup m(x|u[sub obs])] (T is greater than or equal to t[sub obs]);
here u[sub obs] = u(x[sub obs]) and (formally)
(11) m(t|u) = Integral of f(t|u; theta) pi (theta|u) d theta,
assuming that
(12) pi(theta|u) = f(u; theta)pi(theta)/ Integral of f(u; theta)pi(theta)d theta
is proper. [Recall that f(t[u; theta) and f(u; theta) are defined as the conditional and marginal densities of T and U under H[sub 0].]
The specific proposal that we recommend, for the case of continuous data, is obtained by choosing U in (10) to be the conditional MLE of theta, given t(x) = t, defined as
(13) theta[sub cMLE](x) = arg max f(x|t, theta) = arg max f(x; theta)/f(t; theta).
We suppress theta[sub cMLE] and call the resulting p value simply the conditional predictive p value, denoted by P[sub cpred] = P[sub cpred(theta[sub cMLE])]. Note that m(t|u) is unaffected by one-to-one transformations of u(x), so that any one-to-one transformation of (13) is satisfactory as the choice of theta[sub cMLE].
Note that when T is conditionally independent of theta[sub cMLE] and (T, theta[sub cMLE]) are jointly sufficient, both of the foregoing proposals for p values agree; that is, p[sub ppost] = p[sub cpred]. This occurs in the following example from Meng (1994), which we use to exhibit the various p values that have been defined so far.
Example 1. Assume that under the null, the X[sub i] are iid N(0, sigma[sup 2]), with sigma[sup 2] unknown. The statistic t(X) = |X| is chosen to measure departure from the model (which would be natural for detecting a discrepancy in the mean of the model). The various p values are given by
(14) p = Pr{|X| > |x[sub obs]|},
with different distributions used to compute the probability. For the Bayesian p values, we utilize the usual noninformative prior for sigma[sup 2] : pi(sigma[sup 2]) proportional to 1/sigma[sup 2]. Finally, define s[sup 2] = SIGMA(x[sub i] - x)[sup 2]/n.
p[sub plug]: Because X Is similar to N(0, sigma[sup 2]/n) and sigma[sup 2] = 1/n SIGMA[sup n, sub i=1] x[sup 2, sub i] = s[sup 2] + x[sup 2] is the MLE, it follows from (2) and (14) that
(15) p[sub plug] = 2[1 - PHI(Square root of n|x[sub obs]|/Square root of s[sup 2, sub obs] + x[sup 2, sub obs])].
One obvious inadequacy with this p value is that p[sub plug] arrow right 2[1 - PHI(Square root of n)], a positive constant, as |x[sub obs]|/s[sub obs] arrow right Infinity. Thus, even with arbitrarily strong evidence against the null model, the p value will not go to 0 (for fixed n). For large n, this limiting constant will of course be small, so that it would not pose a practical problem. In practice, however, the number of observations is often not large in comparison to the number of parameters, so that concerns of this type can be relevant. In any case, such behavior is indicative of a fundamental flaw in the procedure. (In this example, one could achieve results that are more satisfactory by plugging in s[sub obs] rather than the MLE; indeed, this is related to the "conditional plug-in" p value, which, however, is shown in RVV 2000 to also have deficiencies.)
p[sub sim]: A sufficient statistic for sigma[sup 2] is V = SIGMA[sup n, sub i=1], X[sup 2, sub i] = ||X||[sup 2]. The distribution of X, given v[sub obs] = ||x[sub obs]||[sup 2], is uniform on {x : ||x||[sup 2] -||x[sub obs]||[sup 2]}, so that (3) and (14) yield
(16) p[sub sim] = Pr(|X|/||x[sub obs]|| > |x[sub obs]|/||x[sub obs]||)
= Pr(|Z| > |x[sub obs]|/||x[sub obs]||),
where Z has a uniform distribution on {z : ||z||[sup 2] = 1}. Although this might appear to be difficult to compute, it is shown later (using Theorem 2) that p[sub sim] is exactly equal to p[sub ppost] and p[sub cpred], which in turn are equal to the classical p value for the problem given in (18); we found this result surprising.
p[sub prior]: The prior predictive p value cannot be computed for this example, because the prior distribution is improper.
p[sub post]: The posterior density, pi(sigma[sup 2]|x[sub obs]), is Ga[sup -1](n/2, n(s[sup 2] + x[sup 2])/2), and the posterior predictive distribution of X is m[sub post](x|x[sub obs]) = t[sub n](x|0, 1/n(s[sup 2, sub obs] + x[sup 2, sub obs])); here Ga[sup -1] and t[sub n] denote the inverse gamma distribution and the t distribution with r, degrees of freedom. From (6) and (14), it follows that
(17) p[sub post] = 2[1 - Y[sub n](Square root of n x[sub obs]/Square root of s[sup 2, sub obs] + x[sup 2, sub obs])],
where Y[sub n], represents the distribution function of the t distribution with n degrees of freedom. As could be expected, (17) is very similar to p[sub plug], given in (15). Indeed, it has the similar inappropriate behavior that p[sub post] arrow right 2[1 - Y[sub n](Square root of n)], a positive constant, as |x[sub obs]|/s[sub obs] arrow right Infinity. For instance, when n = 4 this constant is .12, and the posterior predictive p value never drops below this constant, no matter how many standard deviations x[sub obs] is from 0. The inadequacy of p[sub post] here (or of p[sub plug]) can be directly traced to the double use of the data, in particular to the fact that x[sub obs] is involved in computing both the posterior (or the MLE) and the tail area. Interestingly, the problem with p[sub plug] is less severe than that with p[sub post] in that the limiting constant is smaller (.046 when n = 4, for instance).
p[sub cpred]: Computation shows that
f(x|t; sigma[sup 2]) proportional to f(x; sigma[sup 2])/f(t; sigma[sup 2]) proportional to (sigma[sup 2])[sup [(n-1])/2] exp{- ns[sup 2]/2 sigma[sup 2]},
which is maximized sigma[sup 2, sub cMLE] = ns[sup 2]/(n - 1). As observed earlier, it is equivalent to take S[sup 2] as the conditioning statistic. It is then easy to show that pi(sigma[sup 2]|s[sup 2]) is Ga[sup -1](n - 1)/2, ns[sup 2]/2) and that m(x|s[sup 2, sub obs]) = t[sub n-1](x|0, [1/(n - 1)]s[sup 2, sub obs]). The resulting conditional predictive p value is
(18) p[sub cpred] = 2[1 - Y[sub n-1] (Square root of n - 1 x[sub obs]/s[sub obs])].
This is perfectly satisfactory and indeed equals the usual classical p value for the problem based on the usual one-sample t statistic, which is known to be uniform under the null.
p[sub ppost]: In this case T = X is independent of sigma[sup 2, sub cMLE] proportional S[sup 2] (and they are clearly jointly sufficient), so that the partial posterior predictive p value equals the conditional predictive p value, p[sub cpred], in (18).
It should be noted that Meng (1994) also considered use of the departure statistic t(x) = |x|/s[sub obs] in the foregoing example, and with this statistic, the posterior predictive p value and the plug-in p value perform fine (being then equal to the other p values). Note, however, that in more complex problems, it may be quite difficult to find "appropriate" departure statistics for use with the posterior predictive or plug-in p values (RVV 2000).
2.2.1 Bayesian Motivations for p[sub ppost] and p[sub cpred]. The U-conditional posterior predictive p values appear to combine the positive features of both the prior predictive and the posterior predictive p values. First, they are based on the prior predictive m(x), which has natural Bayesian meaning; indeed, when pi(theta) is proper, m(t|u) is simply the conditional distribution of T given U arising from the prior predictive m(x). Second, with appropriate choice of U, (10) can be made to primarily reflect surprise in the model, with the prior playing only a secondary role. Third, noninformative priors can be used, as long as pi(theta|u) is proper. Finally, there is no double use of the data, because only part of the data (u[sub obs]) is used to produce the posterior to eliminate theta, whereas another part (t[sub obs]) is used when computing the tail area.
Of course, the key to the U-conditional predictive p value is a suitable choice of the conditioning statistic U. Different possible choices of U have been explored by Bayarri and Berger (1997). (See also Evans 1997, where the conditional predictive distribution was used to develop alternate measures of surprise, with U and T chosen to be separate subsamples of the data. A rather different possibility was given by the cross-validatory predictive distribution as described in Gelfand, Dey, and Chang 1992; see Carlin 1999 and the rejoinder in Bayarri and Berger 1999 for more discussion.) The intuition behind suitable choice of U is that one wants U to contain as much information about theta as possible, so that pi(theta|u[sub obs]) will effectively eliminate theta (via integration), subject to the constraint that U should not involve T, as this could lead to a reduction in discriminatory power of the procedure. In Example 1, for instance, SIGMA x[sup 2, sub i]/n would contain all information about sigma[sup 2] (being a sufficient statistic under the presumed model), but does involve t(x) = |x|. The obvious solution (used in Example 1) is to define u(x) = s[sup 2] = SIGMA(x[sub i] - x)[sup 2]/n, because this contains the information about sigma[sup 2] that is independent of t(X).
Investigations by Bayarri and Berger (1997) also suggest that u(x) should have the same dimension as theta. The simplest general algorithm that achieves these various aims, for the case of continuous data, is to define U to be the conditional MLE of theta, given t(x) = t, as defined in (13). [The situation of discrete data is considerably more difficult; whereas theta[sub cMLE] in (13) is still typically well defined, it will not be suitable as a conditioning statistic if the resulting conditional sample space contains too few values.]
While logically appealing, the conditional predictive p value, with the conditioning statistic theta[sub cMLE] chosen as in (13), can be difficult to compute. An attractive alternative is to directly use f(x|t; theta) [see (13)] to integrate out theta, rather than simply using it to define theta[sub cMLE]. This leads to the partial posterior predictive p value, defined in (8) and (9), which is typically much easier to work with. Furthermore, the parallel with (13) suggests that the partial posterior predictive p value will be very similar to the conditional predictive p value. This is shown in our (continuous) examples and is further reinforced by RVV (2000), who show that p[sub cpred] and p[sub ppost] are asymptotically equivalent.
The foregoing Bayesian motivations may appear rather "loose," but history in other areas of statistics has shown that when sound Bayesian reasoning and noninformative priors are used to develop procedures, these procedures typically also have very desirable non-Bayesian properties. That this is so for p[sub cpred] and p[sub ppost] is discussed in the next section.
2.2.2 Frequentist Motivations and Comparisons. RVV (2000) show that p[sub cpred] and p[sub ppost] are asymptotic frequentist p values; that is, their asymptotic distribution is U[0, 1] for all theta. In this section we study whether this is so for small samples; we say that a p value is a frequentist p value for all theta if it is U[0, 1] for all theta. We present an illustrative example and two relevant theorems, the first of which follows.
Theorem 1. Let p(X) be any U-conditional predictive p value for a proper pi(theta), and consider it as a random variable with respect to the distribution f(x; theta). Assume that the distribution of p(X) does not depend on theta, and that the conditional distribution of T, given U, is absolutely continuous. Then p(X) is a frequentist p value for all theta. The conclusion also holds for improper pi(theta) under condition (A. 1) in the Appendix, which is in particular satisfied if U has a location or scale-parameter distribution and pi(theta) is the reference prior.
Proof. Suppose that pi(theta) is proper. Then both the conditional predictive distribution for T, m(t|u) = Integral of f(t|u; theta)pi(theta|u) d theta, and the prior predictive for U, m(u) = Integral of f(u; theta)pi(theta) d theta, are proper. Also, because p(X) is by definition a proper p value with respect to m(t|u), it follows that
Pr[sup m(x)](p(X) is greater than or equal to alpha) = E[sup m(x)]E[sup m(t|u)][p(X) is greater than or equal to alpha] = E[sup m(u)][alpha] = alpha.
But because
Pr[sup m(x)](p(X) is greater than or equal to alpha) = E[sup pi](theta)(E[sup f(x; theta)][p(X) is greater than or equal to alpha]),
it follows that if p(X) has a distribution that does not depend on theta, then
Pr[sup m(x)](p(X) is greater than or equal to alpha) = E[sup pi(theta)][c(alpha)] = c(alpha),
where c(alpha) is some function of alpha. It is immediate that c(alpha) = alpha and hence that p(X) is an exact p value. The proof for the improper case is given in the Appendix.
An obvious situation in which Theorem 1 applies is when U can be taken to be a sufficient statistic for theta. In that case m(t|u) = f(t|u), and the U-conditional predictive p value equals the frequentist similar p value. Another application of Theorem 1 is to p[sub cpred] (and p[sub ppost]) in Example 1. From (18), it is clear that their distributions do not depend on sigma[sup 2], because the distribution of Square root of(n - 1)X/S is independent of sigma[sup 2]. Also, sigma[sup 2, sub cMLE] has a scale-parameter distribution, so it can be immediately concluded that p[sub cpred] and p[sub ppost] are frequentist p values for all sigma[sup 2].
In Example 1, p[sub plug] and p[sub post] will not be frequentist p values, but the extent to which they deviate from uniformity must be studied numerically. We thus turn to a simpler situation where exact computations can be performed.
Example 2. Assume that X[sub 1], X[sub 2],..., X[sub n] is a random sample from the exponential(lambda) distribution. Let T = X[sub (1)] (which could be used to investigate the lower tail of the null distribution) and assume that the usual noninformative prior pi(lambda) = 1/lambda is to be used. In the following, X[sub (1)] < X[sub (2)] < ... < X(n) denote the order statistics for the observations. Also, define S = SIGMA[sup n, sub i=1] X[sub i] and let s[sub obs] be the sum of the observed x[sub i]. We derive the different p values and investigate their properties. The following fact, established in the Appendix, as used repeatedly:
(19) Pr(T/S is less than or equal to c) = 1 - (1 - nc)[sup n-1].
p[sub plug]: Clearly, lambda = n/S and T Is similar to Ex(n lambda), so that
(20) p[sub plug] = e[sup -n[sup 2]t[sub obs]/s[sub obs]].
That this is conditionally unsatisfactory can be seen by taking nt[sub obs]/s[sub obs] arrow right 1, in which case the model would clearly be contraindicated, yet p[sub plug] arrow right e[sup -n], a nonzero constant. To investigate whether p[sub plug] is a frequentist p value for all lambda, an easy computation using (20) and (19) yields, for alpha > e[sup -n],
(21) Pr(p[sub plug](X) is greater than or equal to alpha) = Pr(T/S is greater than or equal to -log alpha/n[sup 2]) = (1 + log alpha/n)[sup n-1].
Thus p[sub plug](X) does not have a U[0, 1] distribution and is not a frequentist p value. Figure 1 graphs the density corresponding to (21) when n = 2, to show the substantial deviation from uniformity that can occur. Note, however, that p[sub plug](X) is an asymptotic frequentist p value. Indeed, for large n, (21) is approximately given by
Pr(p[sub plug](X) less than or equal to alpha) Is approximately equal to alpha [1 - 1/n(log alpha + 1/2 log[sup 2] alpha)],
which does go to alpha as n arrow right Infinity.
p[sub sim]: Because S is sufficient, the distribution of X[sub 1], X[sub 2],..., X[sub n] given s is uniform on the set {X : SIGMA[sup n, sub i=1] = s}, and so
p[sub sim] = Pr(T > t[sub obs]|s[sub obs]) = Pr(W[sub (1)] > nt[sub obs]/s[sub obs]) = (1 - nt[sub obs]/s[sub obs])[sup n-1]),
where W[sub (1)] = rain(W[sub 1], W[sub 2],..., W[sub n]) and the W[sub i] are iid U(0, 1). This will be seen to be equal to p[sub ppost].
p[sub prior]: The prior predictive p value cannot be computed for this example, because the prior distribution is improper.
p[sub post]: The posterior distribution of lambda is easily seen to be Ga(n, s[sub obs]), and the posterior predictive density of T is (n[sup 2]/s[sub obs])[(s[sub obs]/(nt + s[sub obs]))][sup n+1]. The posterior predictive p value can then be computed as
p[sub post] = Pr[sup m[sub post](t|x[sub obs])](T > t[sub obs]) = (1 + nt[sub obs]/s[sub obs])[sup -n]
It can be seen that p[sub post] arrow right 2[sup -n], a nonzero constant, as nt[sub obs]/s[sub obs] arrow right 1, which is not appropriate behavior. Moreover, the distribution of p[sub post] is not U[0, 1]. Indeed, for alpha > 2[sup -n],
(22) Pr(p[sub post](X) less than or equal to alpha) = Pr[T/S is greater than or equal to - 1/n (1/alpha[sup 1]/n)] = (2 - alpha[sup -1/n])[sup n-1].
The corresponding density function is graphed in Figure 1 when n = 2, and is even further from uniformity than is the density corresponding to p[sub plug]. Again, however, p[sub post] can be shown to be asymptotically U[0, 1].
p[sub ppost]: An easy computation shows that
(23) f(x|t; lambda) proportional to lambda[sup n-1] exp {-lambda(SIGMA x[sub i] - nt)},
so that the partial posterior for lambda is
pi(lambda|x[sub obs]\t[sub obs]) = lambda[sup n-2]e[sup -lambda(s[sub obs] - nt[sub obs])]/ GAMMA(n - 1)(s[sub obs] - nt[sub obs])[sup -(n-1)]
and the partial posterior predictive density is
m(t|x[sub obs]\t[sub obs]) = n(n - 1)(s[sub obs] - nt[sub obs])[sup n-1]/ (nt + s[sub obs] - nt[sub obs])[sup n].
The partial posterior predictive p value can then be computed as
p[sub ppost] = Pr[sup m(t|x[sub obs]\t[sub obs])](T > t[sub obs])
= (1 + nt[sub obs]/s[sub obs] - nt[sub obs])[sup -(n-1)]
= (1 - nt[sub obs]/s[sub obs])[sup n-1],
which is identical to the similar p value. It can be shown that p[sub ppost] arrow right 0 as nt[sub obs]/s[sub obs] arrow right 1, so that there are no apparent conditional difficulties with the partial posterior predictive p value. It will also be seen that p[sub ppost] is a frequentist p value for all n.
p[sub cpred]: Maximization of (23) over lambda yields lambda[sub cMLE] proportional to SIGMA[sup n, sub i=1]X[sub i] - n X[sub 1] = S - nT. It can be seen that X[sub (i)] - X[sub (1)] is independent of X[sub (1)], from which it follows directly that lambda[sub cMLE] is independent of T. As discussed in the paragraph preceeding Example 1, it follows that p[sub cpred] -- p[sub ppost]. Note that the derivation of p[sub ppost] was considerably simpler than that of p[sub cpred]. Finally, as in the argument leading to (19), it can be shown that Pr(p[sub ppost](X) less than or equal to alpha) does not depend on lambda. Also, lambda[sub cMLE] has a scale-parameter distribution, and so, by Theorem 1, p[sub cpred] (and hence also p[sub ppost] and p[sub sim]) is a frequentist p value. Notice that Theorem 1 cannot be directly applied to p[sub ppost].
It is something of a curiosity that in both Examples 1 and 2, p[sub sim] coincides with p[sub cpred] and p[sub ppost], especially because p[sub sim] and p[sub cpred] are determined from distributions on completely different (conditional) spaces. This is useful methodologically for those who wish to use p[sub cpred] or p[sub sim], because it is typically much easier to derive p[sub ppost] than either of the other two p values. The following theorem gives more general conditions under which this equivalence holds. (It is easy to see that Examples 1 and 2 both satisfy the conditions of the theorem.)
Theorem 2. Suppose that f(x; theta) is a continuous density from the natural scale exponential family and that statistics T > 0 and U > 0 exist such that S = T + U is sufficient and
f(t, u; theta) = k theta[sup alpha]t[sup gamma]u[sup alpha-gamma-2] exp{-theta(t + u)},
for some constants k, gamma > -1, and gamma < alpha - 1. Under the usual noninformative prior, pi(theta) = 1/theta, it will be the case that p[sub cpred], p[sub ppost], and p[sub sim] are all equal.
Proof. That p[sub cpred] and p[sub ppost] are equal follows from direct calculation. To show their equality with p[sub sim], first integrate f(t, u; theta) with respect to pi(theta) = 1/theta, obtaining
m(t, u) proportional to t[sup gamma]u[sup alpha-gamma-2]/(t + u)[sup alpha].
Thus m(t|u[sub obs]) = ct[sup gamma](t + u[sub obs])[sup -alpha], where c is the appropriate normalizing constant, and hence
(24) p[sub cpred] = Integral of[sup Infinity, sub t[sub obs]] m(t|u[sub obs]) dt = c Integral of[sup Infinity, sub t[sub obs]] t[sup gamma]/(t + u[sub obs])alpha dt.
An easy computation shows that the conditional density of T given S is
f[sup (*)](t|s) = c[sup *] t[sup gamma](s - t)[sup alpha-gamma-2]/s[sup alpha], for 0 < t < s,
where c[sup *] is the appropriate normalizing constant. Hence p[sub sim] is given by
p[sub sim] = Integral of[sup Infinity, sub t[sub obs]] f(t|s[sub obs]) dt = c[sup *] Integral of[sup s[sub obs], sub obs] t[sup gamma](s[sub obs] - t)[sup alpha-gamma-2] dt.
Changing variables to t = (s[sub obs]w)/(w + u[sub obs]), the latter integral reduces to that in (24), and the theorem follows.
2.3 Computation
Simulation methods are typically needed to compute the partial posterior predictive p value. These simulations will typically be only modestly more difficult than those involved in computation of either the prior predictive p value or the posterior predictive p value, providing that f(t[sub obs]; theta) is available in closed form.
Noting that the partial posterior predictive p value can be rewritten as
p[sub ppost] = Integral of Pr(T is greater than or equal to t[sub obs]; theta)pi (theta|x[sub obs]\t[sub obs])d theta,
an obvious strategy is to repeatedly generate theta from pi(theta|x[sub obs]\t[sub obs]) and then T from f(t; theta) [which could of course be done by simply generating X from f(x; theta) and computing t(X)], estimating p[sub ppost] by the fraction of generated T that are greater than t[sub obs].
There are various possibilities for generating from pi(theta|x[sub obs]\t[sub obs]. If generation from the full posterior pi(theta|x[sub obs]) is easy, then a natural possibility is to use the following simple Metropolis chain: use pi(theta|x[sub obs]) as the probing distribution to obtain a candidate theta[sup *], and then move from the current theta to the candidate with probability min{1, f(t[sub obs]; theta)/f(t[sub obs]; theta[sup *])}.
In some sense, a "bad" discrepancy statistic T is one for which f(t[sub obs]; theta) is highly variable in theta. (Casually chosen T for model checking will often have this property.) Whereas such T will not yield good p values by standard methods, the results in this article (and in RVV 2000) indicate that the new conditional p values will still be quite satisfactory. The price to be paid, however, is that computation of the new p values can be more difficult with such T, because pi(theta|x[sub obs]) may no longer be a good probing distribution. A slight modification of the foregoing Metropolis chain can then be considerably more efficient: generate U, a uniform random variable on (0, 1); generate theta' from pi(theta|x[sub obs]); and choose the candidate theta[sup *] = theta' + U(theta - theta[sub cMLE]), where theta and theta[sub cMLE] are the MLE and the conditional MLE [see (13)] of theta. Then, move from the current theta = theta[sup o] + U[sup o](theta -theta[sub cMLE]) to the candidate theta[sup *] with probability
min {1, f(t[sub obs]; theta)/f(t[sub obs]; theta[sup *]) pi(theta[sup *])/pi(theta) pi(theta[sup o])/pi(theta') f(x[sub obs]; theta[sup *])/f(x[sub obs]; theta) f(x[sub obs]; theta[sup o])/ f(x[sub obs]; theta')}.
Alternatives to such direct Monte Carlo computation of p[sub ppost] include importance sampling schemes. For instance, if a (possibly dependent) sample {theta[sub j], j = 1,..., m} from pi(theta|x[sub obs]) were available, then one could estimate p[sub ppost] by
p[sub ppost] = SIGMA[sup m, sub j=1] Pr(T is greater than or equal to t[sub obs]; theta[sub j])/ f(t[sub obs]; theta[sub j])/SIGMA[sup m, sub j=1] 1/f(t[sub obs]; theta[sub j])
[This would work because f(x[sub obs]; theta) is typically considerably more concentrated than f(t[sub obs]; theta).]
When f(t[sub obs]; theta) is not available in closed form, it must be estimated, possibly through some type of kernel estimate; note that T typically is a one-dimensional statistic, and estimation of a one-dimensional density at a point usually is not excessively difficult. Of course, this estimation must be done in conjunction with the Metropolis or importance sampling schemes mentioned earlier, and efficiency might improve if one keeps only widely spaced (i.e., approximately independent) theta.
Computing p[sub cpred(u)] is usually considerably more difficult, unless the densities on the left side or the right side of (11) are available in closed form. Various Gibbs and Metropolis-Hasting schemes for its computation were given by Bayarri and Berger (1999) and are not repeated here. The computational difficulty of p[sub cpred(u)] (see also Pauler 1999) is the main reason why we recommend p[sub ppost] for routine use.
In this section we derive the various p values for the normal linear model and give characterizations of their degree of uniformity (frequentist sense). This section was motivated by RVV (2000), who derive corresponding results under assumptions that yield asymptotic normality. Seeing the results in the finite-sample setting (under normality) should help alleviate any concerns about "asymptopia," Note that in this section we do not attempt to distinguish between the MLE and its value at the observed data; both are denoted by theta.
Let Y = (Y[sub 1], Y[sub 2],..., Y[sub n])[sup t] be the n x 1 vector of response variables, let theta = (theta[sub 1], theta[sub 2],..., theta[sub k])t be the k x 1 vector of regression coefficients, let V be a full-rank n x k matrix of covariables, and let e be an n x 1 vector of errors. Assume that we are testing
(25) H[sub 0] : Y = V theta + epsilon, epsilon Is similar to N[sub n](0, sigma[sup 2]1), sigma[sup 2] known.
Consider a linear departure statistic T = w[sup t]Y, with given w = (w[sub 1], w[sub 2],..., w[sub n])[sup t]. It follows from (25) that
(26) T|theta Is similar to N(w[sup t]V theta, sigma[sup 2]||w||[sup 2]).
Also, with the usual noninformative prior, pi(theta) = 1, the posterior distribution, pi(theta|y), is N[sub k](theta|theta, sigma[sup 2](V[sup t]V)[sup -1]), where theta = (V[sup t]V)[sup -1]V[sup t]y is the usual least squares estimate.
3.1 Plug-In p Value
It follows from (26) that p[sub plug] is given by
p[sub plug] = Pr[sup f(t;theta)](T > t[sub obs]) = 1 - PHI(t[sub obs] -w[sup t]V theta/sigma||w||).
To study the distribution of p[sub plug](Y) (to assess its frequentist uniformity), note that
(27) T - w[sup t]V theta Is similar to N(w[sup t]BV theta, sigma[sup 2]w[sup t]BB[sup t]w),
where B = I - V(V[sup t]V)[sup -1]V[sup t]. Because BV = 0 and BB[sup t] = B, it follows that
(28) p[sub plug](Y) = 1 - PHI(Square root of w[sup t]Bw/||w||[sup 2] Z),
where Z Is similar to N(0, 1). Thus p[sub plug](Y) will have a U[0, 1] distribution only if w[sup t]Bw/||w||[sup 2] = 1, which in turn can happen only if V[sup t]w = 0. Although the latter will be satisfied by common choices of T, such as a linear function of the vector of residuals, it clearly need not hold in general. When it does not hold, w[sup t]Bw/||w||[sup 2] will be smaller than 1, so that p[sub plug] will be conservative.
3.2 Posterior Predictive p Value
The posterior predictive distribution of T, given y[sub obs] is N(w[sup t]V theta, sigma[sup 2]w[sup t]Cw), where C = I + V(V[sup t]V)[sup -1]V[sup t]. It follows that the posterior predictive p value is given by
p[sub post] = Pr[sup m[sub post](t|y[sub obs])](T > t[sub obs]) = 1 -PHI(t[sub obs] - w[sup t]V theta/ sigma Square root of w[sup t]Cw).
When considered as a random p value and using (27), p[sub post] can be expressed as
(29) p[sub post](Y) = 1 - PHI(Square root of w[sup t]Bw/w[sup t]Cw Z).
Again, this will be U[0, 1] only if V[sup t]w = 0. Otherwise, w[sup t]Cw will be larger than ||w||[sup 2] and, comparing (28) and (29), p[sub post] will then be even more conservative than p[sub plug]. This observation was first made in the asymptotic setting by Robins (1999) and RVV (2000).
3.3 Partial Posterior Predictive p Value
Calculation yields
(30) f(y|t[sub obs]; theta) proportional to exp{-1/2 sigma[sup 2] [(theta - theta)[sup t] V[sup t]V(theta - theta)
- (w[sup t]V theta - T)[sup t](||w||[sup 2])[sup -1]w[sup t]V theta -T)]}.
Because the partial posterior distribution pi(theta|y[sub obs] \ t[sub obs]) is proportional to (30), expanding the quadratic forms in (30) and rearranging terms yields
(31) pi(theta|y[sub obs] \ t[sub obs]) = N[sub k](theta]|u[sub obs], sigma[sup 2]SIGMA),
where U = (V[sup t]HV)[sup -1]V[sup t]HY, SIGMA = (V[sup t]HV)[sup -1], H = [I - (ww[sup t]/||w||[sup 2])], and the right side of (31) denotes the k-variate normal density in theta with the given mean and covariance matrix. From (26) and (31), it follows that the partial predictive distribution of T is given by
T|y[sub obs] \ t[sub obs] Is similar to N(w[sup t]Vu[sub obs], sigma[sup 2]w[sup t][I + V SIGMA V[sup t]]w),
so that the partial posterior predictive p value is
p[sub ppost] = Pr[sup m(t|y[sub obs] \ t[sub obs])](T > t[sub obs])
= 1 - PHI(t[sub obs] - w[sup t]Vu[sub obs]/sigma Square root of w[sup t][I + V SIGMA V[sup t]]w).
To study the distribution of p[sub ppost](Y), note that
T - w[sup t]VU|theta Is similar to N(w[sup t]DV theta, sigma[sup 2]w[sup t]DD[sup t]w),
where D = I - V(V[sup t]HV)[sup -1]V[sup t]H and H is as in (31). Algebra shows that w[sup t]DV = 0 and w[sup t]DD[sup t]w = w[sup t][I + V SIGMA V[sup t]]w, so that p[sub ppost](Y) = 1 - PHI(Z), where Z Is similar to N(0, 1). Thus p[sub ppost] is a valid frequentist p value.
3.4 Conditional Predictive p Value
It can easily be seen from (30) and (31) that the U maximizing (30) is precisely the statistic U given in (31); that is, U = (V[sup t]HV)[sup -1]V[sup t]HY. Because T and U have a joint (k + 1)-variate normal distribution such that cov(T, U) = w[sup t]HV(V[sup t]HV)[sup -1] = 0, they are independent. It follows that p[sub cpred] equals p[sub ppost], and hence it is also a valid frequentist p value.
For discrete sample spaces, the most common classical approach to defining p values is to condition on a statistic U such that f(x|u; theta) does not depend on theta. The Fisher exact test is the prototypical example that we consider here. Note that conditioning on any U can severely constrain the sample space, resulting in serious conservatism of the resulting p value (because there may then be very few possible observations in the "tail" of the departure statistic, T). We see that p[sub ppost] can substantially overcome this difficulty. [We do not consider p[sub cpred], because the choice of the conditioning statistic in (13) typically will not work in discrete problems, and because any conditional p value can fall prey to the same difficulty mentioned earlier.]
We specifically consider the problem of testing homogeneity and independence in 2 x 2 contingency tables, comparing the similar p value (which is the Fisher exact test) and the partial posterior predictive p value. Many other p values for contingency tables have been proposed (a nice survey of proposed tests was given in Agresti 1992; see also Hwang and Yang 1997), and many of these perform considerably better than the Fisher exact test. Our attitude here is not that of seeking to determine an optimal p value for these situations, but rather to see if straightforward implementation of p[sub ppost] can offer significant gains. (Recall that we hope to see p[sub ppost] used in situations of considerable complexity, in which there is little hope of determining optimal p values; in judging the effectiveness of p[sub ppost], however, it is useful to consider moderately difficult situations, such as this, to see whether an easy implementation works.)
Consider the following 2 x 2 contingency table:
A[sub 1] A[sub 2] Totals
B[sub 1] X[sub 11] X[sub 12] X[sub 1+] B[sub 2] X[sub 21] X[sub 22] X[sub 2+] Totals X[sub +1] X[sub +2] n
We analyze two common scenarios involving such tables.
Case 1. One of the margins, say X[sub +1] = n[sub 1], X[sub +2] = n[sub 2], is fixed by the design, so that X[sub 11] and X[sub 12] can be viewed as independent binomial random variables. We want to study the null model of homogeneity, that the two binomial distributions have the same success probability, theta.
Case 2. The design fixes only the overall sample size, n. A common null model is that classification by A and B is independent, so that the probability of each cell is the product of the corresponding marginal probabilities.
For convenience of notation in this section, we denote an observed value of X.. by a superscript "o"; that is, x[sup o].. .
4.1 Case 1: Test of Homogeneity
Here the null model is
(32) [Multiple line equation(s) cannot be represented in ASCII text]
The Fisher exact test conditions on the other marginal total, say X[sub 1+]. It is common in textbooks to then take the test statistic (in the conditional problem) to be T = X[sub 11]. An easy computation shows that the (one-tailed) p value corresponding to the Fisher exact test (p[sub fet]) is given by
[Multiple line equation(s) cannot be represented in ASCII text]
Unconditionally, T = X[sub 11] is not a particularly sensible statistic for measuring departure from homogeneity. Indeed, Suissa and Shuster (1985) proposed a sensible unconditional T for this particular problem. In illustrating p[sub ppost], however, we first study what happens if one naively "follows the textbooks" and chooses T = X[sub 11], even though this is not sensible unconditionally. (Our point is to show that p[sub ppost] behaves admirably even with a simple, but rather inappropriate, choice of T.) Then we consider a choice of T that is more reasonable from an unconditional perspective.
Choosing the constant prior pi(theta) = 1, an easy computation shows that the partial posterior distribution is
beta(x[sup o, sub 12] + 1, n[sub 2] - x[sup o, sub 12] + 1) and
[Multiple line equation(s) cannot be represented in ASCII text]
Incidentally, it can be shown in this problem (for the given choice of T) that p[sub cpred] = p[sub ppost]; this is thus a discrete situation in which conditioning as in (13) does not unduly restrict the sample space.
A more sensible unconditional choice of the discrepancy statistic is T = [(1/n[sub 1])X[sub 11] - (1/n[sub 2])X[sub 22]] (because the null model is that the two binomial populations have the same success probability). The partial posterior predictive p value for this choice does not admit a simple closed-form expression but can be readily computed numerically.
Example 3. As a rather extreme test case, consider n[sub 1] = n[sub 2] = 3. Here conditioning on x[sub 1+] severely restricts the support of the distribution of p[sub fet], which reduces to {.05, .2, .5, .8, .95, 1}. The supports of the distributions of p[sub ppost], for either choice of T, are considerably richer and include more values closer to 0 and 1.
Figure 2 gives the distribution functions of p[sub fet](X) and p[sub ppost](X) (for both choices of T) at two different values of theta. Recall that the goal is to have p values with close to uniform distributions, and p[sub ppost] clearly fares much better in this regard (the straight dotted lines being the unattainable uniform ideal). As expected, the Fisher exact test is very conservative, which translates into a severe lack of discriminatory power.
Note that p[sub ppost] seems to perform somewhat better with the "sensible" discrepancy statistic T = [(1/n[sub 1])X[sub 11] -(1/n[sub 2]) X[sub 22]] than with T = X[sub 11], in the sense that it then has a distribution somewhat closer to uniform in the most interesting region of small values of p. However, p[sub ppost] seems to be quite satisfactory (and much better than the Fisher exact test), even when the intuitively unsuitable T = X[sub 11] is used.
Various other theta were also considered. The distribution of p[sub ppost] for the "sensible" choice of T is remarkably stable and performs very well for all values of theta. The other two p values (p[sub fet] and p[sub ppost] with T = X[sub 11]) were excessively conservative for small theta, although p[sub ppost] began to perform noticeably better even for values of theta as small as .2. For large values of theta, p[sub fet] was again very conservative, whereas p[sub ppost] performed remarkably well, unless theta was very large; we discuss this latter situation more fully later.
4.2 Case 2: Test of Independence
Referring to the contingency table with fixed n and defining theta = Pr(A[sub 1]) and xi = Pr(B[sub 1]), the null model under independence of classification can be expressed as
f(x; theta, xi) = (n!/x[sub 11]!x[sub 12]!x[sub 21]!x[sub 22]!)theta[sup x[sub +1]] (1 - theta)[sup x[sub +2]]xi[sup x[sub 1+]](1 - xi)[sup x[sub 2+]].
Here the Fisher exact test conditions on both margins, and again the "textbook" conditional departure statistic is typically chosen to be T = X[sub 11]. The ensuing conditional density of T is
[Multiple line equation(s) cannot be represented in ASCII text]
which produces the same p value as in the test for homogeneity. (Note that here, x[sup o, sub 1+] plays the role of n[sub 1] in Case 1.)
In deriving p[sub ppost], we restrict attention to the statistic T = X[sub 11], even though this is not particularly sensible from an unconditional perspective. We do this in part so that it cannot be argued that we obtain better results than p[sub fet] by choice of a better T and, in part, to indicate the quality of p[sub ppost] even with an inferior choice of T.
Using uniform independent priors for theta and xi, the partial posterior, pi(theta, xi|x[sub obs] \ t[sub obs]), is proportional to f(x|t[sub obs]; theta, xi), and p[sub ppost] can most conveniently be expressed as
(33) [Multiple line equation(s) cannot be represented in ASCII text]
where
(34) pi(theta, xi|x[sub obs] \ t[sub obs])
proportional to theta[sup x[sup o, sub 21]](1 - theta)[sup x[sup o, sub +2]] xi[sup x[sup o, sub 12]] (1 - xi)[sup x[sup o, sub 2+]](1 - theta xi)[sup -(n - t[sub obs])].
(Note that for this case, p[sub ppost] and p[sub cpred] differ, and indeed the latter can be problematical because of possibly restrictive conditioning.)
To compute p[sub ppost], we use importance sampling based on the importance function
(35) 1/2 U(theta|0, 1)beta(xi|x[sup o, sub 12] + 1, x[sup o, sub 22] + 1)
+ 1/2 beta(theta|x[sup o, sub 21] + 1, x[sup o, sub 22] + 1)U(xi|0, 1).
Not only is this an easy importance function to use in terms of random variable generation, but it also is highly efficient computationally, for even very large n. The reasons for this are given in the Appendix, which also presents other computational details.
Example 4. We again consider a rather extreme case, namely n = 5. It can be shown that the support of p[sub fet](X) is limited to {.1, .2, .3, .4, .6, .7, .8, .9}, whereas the support of p[sub ppost](X) is noticeably richer. The distribution functions of these two p values are given in Figure 3 for various values of the parameters. For all but very large values of the parameters, p[sub ppost] seems considerably more uniform than p[sub fet].
Further investigations revealed that when both 0 and are small, either p value is quite conservative, with p[sub ppost] the less conservative. Both p values perform at their best for theta, xi Is approximately equal to .5, with p[sub ppost] performing much better. When one of theta or xi is small and the other is large, p[sub fet] is again very conservative, whereas p[sub ppost] performs remarkably well.
Both p[sub fet] and p[sub ppost] are probably asymptotic frequentist p values, and it is of interest to ask how large n must be for their distributions to be approximately uniform. This is especially interesting for smaller values of p, which are typically of most interest when n is large. An illustrative such feature is the sample size needed for the distribution function of a p value at .05 to be within 20% of .05 (the value under the desired uniformity). When (theta, xi) = (.6, .5), this obtains for p[sub fet] only when n Is approximately equal to 500; in contrast, this occurs for p[sub ppost] when n Is approximately equal to 10. When (theta, xi) = (.3, .9), p[sub fet] requires n Is approximately equal to 1200, whereas p[sub ppost] needs only n Is approximately equal to 110.
The apparent breakdown of both p[sub fet] and p[sub ppost] for large values of (theta, xi) [such as (.9, .9) in Fig. 3] deserves special discussion. First, note that p[sub fet] becomes almost hopelessly conservative, never stating that the data is incompatible with the model. In contrast, p[sub ppost] is markedly anticonservative for this situation. At a very intuitive level, the behavior of p[sub ppost] seems more sensible. After all, we declared large values of T to be evidence against the null model, and when (theta, xi) are both large, the values of T = X[sub 11] clearly will typically be very large; p[sub ppost] reacts to this with ready "rejection" of the null model, whereas p[sub fet] ignores all but incredibly large T. This anticonservative behavior of p[sub ppost] arises because a very large value of T = X[sub 11] contains a great deal of information about the parameters, but relatively little information about deviance from the model. This is one negative consequence of using an inferior choice of T.
The most extreme example of an inappropriate choice of T is a sufficient statistic for the parameter; such a statistic is nearly useless for model checking. We examine this further in a very simple example, so as to better understand the nature of p[sub ppost] in such a situation.
Example 5. Assume that the null model is X[sub i] Is similar to Bernoulli(theta), i = 1,..., n, and that T = SIGMA X[sub i], a sufficient statistic. Here m(t|u) = m(t|x[sub obs] \ t[sub obs]) = m(t) = 1/(n + 1) for t = 0,..., n, and p[sub ppost] = 1 - t[sub obs]/(n + 1). For large n, the distribution of p[sub ppost] is approximately N(1 -theta, theta(1 - theta)/n), which concentrates tightly around 1 - theta. Thus when theta is large, the distribution function of p[sub ppost] jumps immediately, giving rise to anticonservative p values; in contrast, for small theta, the situation reverses, and p[sub ppost] is conservative. Figure 4 shows the resulting distribution functions for three values of theta when n = 100.
Of course, this behavior of p[sub ppost] is entirely natural according to Bayesian intuition; large values of T are essentially equivalent to large values of theta, and as such are declared to be "surprising." As another argument, note that p[sub ppost] is equivalent to p[sub prior] here, and choosing T to be sufficient is effectively stating that we will also allow its presence in the tail of the prior to discredit the model.
In contrast, the distributions of both p[sub plug] and p[sub post] can be seen to concentrate tightly about 1/2 when n is large, for any value of theta. (That p[sub post], when it differs from uniformity, does so by concentrating closer to 1/2 was discussed in Meng 1994; see also Rubin 1996.) This is illustrated in Figure 4 for p[sub plug] when n = 100. Thus, unlike p[sub ppost] which provides some kind of information, p[sub plug] and p[sub post] provide completely useless answers here. (Of course, non-Bayesians may argue that it is better to infer nothing than to in effect base a conclusion on the prior; but recall that in the context we are considering, this means essentially refusing to consider alternatives to the null model, at least when T is chosen poorly.)
As a final comment concerning this issue, recall that requiring uniformity of p values for all values of theta might well be too restrictive for a Bayesian (and also possibly for a frequentist). The natural (Bayesian) requirement (see Meng 1994) is to require a p value to be uniform under the prior predictive distribution. Because in this example, the partial posterior predictive reduces to the prior predictive, it follows that p[sub ppost] is indeed a p value for a Bayesian. (The "average" of all the distribution functions of p[sub ppost] in Fig. 4 is uniform.) In contrast, no Bayesian averages of the distribution functions of p[sub plug] (or p[sub post]) can be uniform. Of course, the Bayesian reasoning in this example is facilitated by the fact that the noninformative prior is actually proper, but related arguments involving averages with respect to classes of priors probably can be made for the improper case; we do not pursue this here.
When considering the choice of T in discrete situations, this issue of sufficiency can arise in subtle ways. In Example 3, for instance, consider T = [(1/n[sub 1])X[sub 11] - (1/n[sub 2])X[sub 22]] when n[sub 1] and n[sub 2] are different prime numbers. From a given nonzero value of T, it is then clear that X[sub 11] and X[sub 22] can be completely reconstructed. In this situation, T thus is nearly sufficient, and its use would encounter the aforementioned difficulties. (In contrast, when n[sub 1] = n[sub 2], this problem with T does not arise.) Such "technical" near-sufficiency of T in discrete settings can be eliminated by the simple device of replacing T by a binned version, T[sup *], with the bin size chosen so that each value of T[sup *] corresponds to several sample points.
Our comparisons have not included p[sub prior], because this cannot typically be used with noninformative priors. Also, p[sub sim] is just a version of a conditional predictive p value, obtained by choosing the conditioning statistic U to be a sufficient statistic (when available). Indeed, in all of our continuous examples it happened that p[sub sim] was equal to p[sub cpred], although this certainly is not true in general. For those wishing to use p[sub sim], this equality is a fortunate occurrence when it obtains (see Theorem 3), because p[sub cpred] is typically much easier to compute directly in those situations than p[sub sim]. The following discussion is thus limited to the other four p values.
A surprising observation in our examples (first discussed in Robins 1999) is that p[sub plug] seems superior to p[sub post], in the sense that it is closer to being a frequentist p value; in particular, it is less conservative. This would seem to contradict the common Bayesian intuition that it is better to account for parameter uncertainty by using a posterior than by simply replacing theta by theta. The explanation is that p[sub post] does not account for parameter uncertainty in a legitimate Bayesian way, because it involves a double use of the data. (Indeed, the original motivation for p[sub cpred] and p[sub ppost] was precisely to account for parameter uncertainty in a legitimate Bayesian fashion.) We have considered only a few situations here, but, together with the similar asymptotic conclusion of RVV (2000), it would seem that p[sub plug] should be preferred to p[sub post] in practice. This is especially so because p[sub plug] is typically easier to compute than p[sub post]. (In some situations in which Bayesian analysis is being performed via MCMC, a posterior sample of theta's might be more readily available than an MLE, but one could then plug in, say, a posterior mean for theta rather than the MLE.) At the very least, our observations here and those of RVV (2000) indicate that it cannot simply be assumed that p[sub post] is better than p[sub plug], as has typically been the case in the literature. It should be noted, however, that posterior predictive p values are also commonly used today with discrepancy statistics that depend on theta, as well as on x, and there are currently no alternatives to their use in such situations (although see RVV 2000).
In all our continuous examples, p[sub plug] performed worse in the frequentist sense than either p[sub ppost] or p[sub cpred]. This again supports the asymptotic conclusions of RVV (2000) and suggests that the latter p values, if available, are to be preferred in practice. Computation is clearly an issue, however, in that p[sub plug] is typically easier to compute than the new p values, especially p[sub cpred] (see also Pauler 1999). Computation of p[sub ppost] is usually not difficult if f(t; theta) is available in closed form, and we would definitely recommend its use in that case.
The (asymptotic) superiority of p[sub ppost] and p[sub cpred] arises when the departure statistic T is not appropriately "centered," as discussed by RVV (2000). In a sense, the new p values can be viewed as automatically "centering" a departure statistic T, which can be a considerable simplification in practice, avoiding the need for asymptotics or clever statistical intuition. Indeed, in model checking one often wishes to try a series of rather generic possible discrepancy statistics T, and having an automatic centering mechanism is a considerable simplification.
On a more speculative note, it is quite plausible that use of p[sub ppost] and p[sub cpred] can result in an improvement (over, say, p[sub plug]) with even "centered" choices of T (as long as the distribution of T still depends on 0 to some extent). This could be improvement in finite-sample performance or in higher-order asymptotic terms.
The situation involving discrete distributions is more complex, but the gains through use of the new p values, especially p[sub ppost] can be quite dramatic. Discreteness of the sample space can cause common p values, such as those from the Fisher exact test, to be very conservative in small samples, whereas the partial posterior p value is rather remarkably uniform, especially if a reasonable discrepancy statistic T is used.
GRAPH: Figure 1. Densities of p[sub plug](X)(---) and p[sub post](X)(----) in Example 2, when n = 2. The uniform density is plotted for reference.
GRAPHS: Figure 2. Distributions of p[sub fet](X) [(a) and (b)]; p[sub ppost](X) with T = X[sub 11] [(c) and (d)] and T = X[sub 11] - X[sub 12] [(e) and (f)] for theta = .4 [(a), (c), and (e)] and theta = .8 [(b), (d), and (f)] in Example 3.
GRAPHS: Figure 3. Distributions of p[sub fet](X) [(a), (c), and (e)] and p[sub ppost](X) [(b), (d), and (f)] in Example 4 for (theta; xi) = (.6, .5) [(a) and (b)] (theta; xi) = (.3, .9) [(c) and (d)] and (theta; xi) = (.9, .9) [(e) and (f)].
GRAPHS: Figure 4. Distributions of p[sub plug](X) (a) and p[sub ppost](X) (b) in Example 5 for theta = .2, .5, .9and n = 100.
Agresti, A. (1992), "A Survey of Exact Inference for Contingency Tables" (with discussion), Statistical Science, 7, 131-177.
Aitkin, M. (1991), "Posterior Bayes Factors" (with discussion), Journal of the Royal Statistical Society, Ser. B, 53, 111-142.
Bayarri, M. J., and Berger, J. O. (1997), "Measures of Surprise in Bayesian Analysis," ISDS Discussion Paper 97-46, Duke University.
----- (1999), "Quantifying Surprise in the Data and Model Verification," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 53-82.
Berger, J. O., Boukai, B., and Wang, W. (1997), "Unified Frequentist and Bayesian Testing of Precise Hypotheses," Statistical Science, 12, 133-160.
Berger, J. O., Brown, L. D., and Wolpert, R. L. (1994), "A Unified Conditional Frequentist and Bayesian Test for Fixed and Sequential Simple Hypothesis Testing," The Annals of Statistics, 22, 1787-1807.
Berger, J. O., and Delampady, M. (1987), "Testing Precise Hypotheses" (with discussion), Statistical Science, 2, 317-352.
Berger, J. O., and Sellke, T. (1987), "Testing a Point Null Hypothesis: The Irreconciability of p Values and Evidence," Journal of the American Statistical Association, 82, 112-122.
Berger, R. L., and Boos, D. D. (1994), "P Values Maximized Over a Confidence Set for the Nuisance Parameter," Journal of the American Statistical Association, 89, 1012-1016.
Blyth, C. R., and Staudte, R. G. (1995), "Estimating Statistical Hypotheses." Probability and Statistics Letters, 23, 45-52.
Box, G. E. P. (1980), "Sampling and Bayes Inference in Scientific Modeling and Robustness," Journal of the Royal Statistical Society, Ser. A, 143, 383-430.
Carlin, B. P. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification," by M. J. Bayarri and J. O. Berger, in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 73-74.
De la Horra, J., and Rodriguez-Bernal, M. T. (1997), "Asymptotic Behaviour of the Posterior Predictive P-Value," Communications in Statistics. Part A--Theory and Methods, 26, 2689-2699.
Delampady, M., and Berger, J. O. (1990), "Lower Bounds on Bayes Factors for Multinomial Distributions, With Applications to Chi-Squared Tests of Fit," The Annals of Statistics, 18, 1295-1316.
Edwards, W., Lindman, H., and Savage, L. J. (1963), "Bayesian Statistical Inference for Psychological Research," Psychological Review, 70, 193-242.
Evans, M. (1997), "Bayesian Inference Procedures Derived via the Concept of Relative Surprise," Communications in Statistics, Part A--Theory and Methods, 26, 1125-1143.
Gelfand, A. E., Dey, D. K., and Chang, H. (1992), "Model Determination Using Predictive Distributions With Implementation via Sampling-Based Methods," in Bayesian Statistics 4, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 147-167.
Gelman, A., Carlin, J. B., Stern, H., and Rubin, D. B. (1995). Bayesian Data Analysis, London: Chapman and Hall.
Gelman, A., Meng, X. L., and Stern, H. (1996), "Posterior Predictive Assessment of Model Fitness via Realized Discrepancies" (with discussion), Statistica Sinica, 6, 733-807.
Guttman, I. (1967), "The Use of the Concept of a Future Observation in Goodness-of-Fit Problems," Journal of the Royal Statistical Society, Set. B, 29, 83-100.
Hwang, J. T., Casella, G., Robert, C., Wells, M., and Farrell, R. (1992), "Estimation of Accuracy of Testing," The Annals of Statistics, 20, 490-509.
Hwang, J. T., and Pemantle, R. (1997), "Estimating the Truth Indicator Function of a Statistical Hypothesis Under a Class of Proper Loss Functions," Statistics and Decisions, 15, 103-128.
Hwang, J. T., and Yang, M.-C. (1997), "Evaluate the P-Values for Testing the Independence in 2 x 2 Contingency Tables Using the Estimated Truth Approach--One Way to Resolve the Controversy Relating to Fisher's Exact Test," technical report, Cornell University, Dept. of Statistical Science.
Meng, X. L. (1994), "Posterior Predictive P-Values," The Aramis of Statistics, 22, 1142-1160.
Pauler, D. K. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification," by M. J. Bayarri and J. O. Berger, in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 70-72.
Robins, J. M. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification," by M. J. Bayarri and J. O. Berger, in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 67-70.
Robins, J. M., van der Vaart, A., and Ventura, V. (2000), "The Asymptotic Distribution of p Values in Composite Null Models," Journal of the American Statistical Association, 95, 1143-1156.
Rubin, D. B. (1984), "Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician,' The Annals of Statistics, 12, 1151-1172.
----- (1996), Discussion of "Posterior Predictive Assessment or' Model Fitness via Realized Discrepancies," by A. Gelman, X. Meng, and H. S. Stern, Statistica Sinica, 6, 787-792.
Schaafsma, W., Tolboom, J., and Van Der Meulen, E. A. (1989), "Discussing Truth or Falsity by Computing a Q-Value," in Statistical Data Analysis and Inference, eds. Y. Dodge, Amsterdam: North-Holland. pp. 85-100.
Sellke, T., Bayarri, M. J., and Berger, J. O. (1999), "Calibration of P Values for Precise Null Hypotheses," ISDS Discussion Paper 99-13, Duke University.
Suissa, S., and Shuster, J. J. (1985), "Exact Unconditional Sample Sizes for the 2 x 2 Binomial Trial," Journal of the Royal Statistical Society. Ser. A, 148, 317-327.
Thompson, P. (1997), "Bayes P-Values," Statistics and Probability Letters. 31, 267-271.
Tsui, K.-W., and Weerahandi, S. (1989), "Generalized P Values in Significance Testing of Hypothesis in the Presence of Nuisance Parameters." Journal of the American Statistical Association, 84, 602a607.
Details for Theorem 1
Suppose that pi(theta) is improper but that there exists a sequence of increasing compact sets THETA[sub k] subset THETA such that union[sub k is greater than or equal to 1] THETA[sub k] = THETA, 0 < m[sub k] = Integral of[sub THETA[sub k]] pi(theta) d theta < Infinity, 0 < m(u) = Integral of[sub THETA] f(u; theta)pi(theta)d theta < Infinity, and
(A.1) [Multiple line equation(s) cannot be represented in ASCII text]
where m[sub k](u) = (Integral of[sub THETA[sub k]] f(u; theta), pi(theta) d theta)/m[sub k]. Then the conclusion of Theorem 1 holds.
Proof. Define h(u, theta) = Pr(p(X) less than or equal to alpha|u; theta). From the definition of p[sub cpred(u)], it follows that
(A.2) Integral of h(u, theta)pi(theta|u)d theta = alpha.
By the assumption that p(X) has a distribution that does not depend on theta, it follows that for some constant c,
(A.3) Integral of h(u, theta)f(u; theta)du = E[Pr(p(X) less than or equal to alpha); theta] = c.
It is immediate from (A.2) and (A.3) that
(A.4) Integral of h(u, theta)pi(theta|u)m[sub k](u)du d theta = alpha
and
(A.5) 1/m[sub k] Integral of h(u; theta)f(u; theta)pi(theta)1[sub theta[sub k]]d theta du = c.
To prove that alpha = c, completing the proof, we need only show that the difference of the left sides of (A.4) and (A.5) goes to 0 as k Arrow right Infinity. Breaking the left side of (A.4) into integrals over THETA[sup C, sub k] and THETA[sub k], and using the fact that
m[sub k](u) = 1/m[sub k] [m(u) - Integral of[sub THETA[sup C, sub k]] f(u; theta[sup *])pi(theta[sup *])d theta[sup *]]
in the second of these integrals, the difference of the left sides of (A.4) and (A.5) can be written
Integral of[sub THETA[sup C, sub k]] h(u, theta)pi(theta|u)m[sub k](u)d theta du - 1/m[sub k] Integral of[sub THETA[sub k]] h(u, theta)pi(theta|u)
x [Integral of[sub THETA[sup C, sub k]] f(u; theta[sup *])pi(theta[sup *])d theta[sup *]]d theta du.
Because h(u, theta) less than or equal to 1, algebra shows that each of these terms is bounded in absolute value by
Integral of[sub THETA[sup C, sub k]] pi(theta|u)m[sub k](u)m[sub k](u)d theta du = 1 - Integral of[sub THETA[sub k]] pi(theta|u)m[sub k](u)d theta du
= 1 - m[sub k] Integral of (m[sub k](u))[sup 2]/m(u) du,
which goes to 0 by (A. 1) and completes the proof.
Verification of (A.1) When U has a Location or Scale Distribution. For convenience, we assume that U has a location distribution with range R and that THETA = R. Other cases can be handled similarly. Write f(u; theta) = g(u - theta), let G(.) denote the cdf corresponding to g(.), and choose THETA[sub k] = (-k, k). Then m(u) = Integral of g(u - theta)d theta = 1, m[sub k] = Integral of[sup k, sub -k](1)d theta = 2k, m[sub k](u) = (1/2k) Integral of[sup k, sub -k] g(u - theta)d theta = (1/2k)[G(u + k) - G(u - k)], and (A.1) becomes
(A.6) [Multiple line equation(s) cannot be represented in ASCII text]
Note first that 2km[sub k](u) = [G(u + k) - G(u - k)] less than or equal to 1, so that (A.6) is trivially bounded above by 1. To establish a suitable lower bound, note that [G(log k) - G(- log k)] > (1 - epsilon) for any given epsilon > 0 and sufficiently large k, so that
2k Integral of (m[sub k](u))[sup 2] du is greater than or equal to 1/2k Integral of[sup k - log k, sub -k + log k] [G(u + k) - G(u - k)][sup 2] du
> (1 - epsilon)[sup 2] 2(k - log k)/2k.
Because epsilon was arbitrary, (A.6) is clearly satisfied.
Verification of (19)
Lemma A.1. Let W = (W[sub 1], W[sub 2],..., W[sub n]) be a random vector with uniform distribution on the simplex SIGMA[sup n, sub i=1] W[sub i] = 1, and let W[sub (1)] = min{W[sub i]}. Then Pr(W[sub (1)] is greater than or equal to c) = 1 - (1 - nc)[sup n - 1].
Proof. We give a geometric argument. The probability to be computed is
(A.7) Pr(W[sub (1)] less than or equal to c) = 1 - Pr(all W[sub i] > c) = 1 - q.
Note that the conditional distribution of W on the set {W: W[sub i] > c for all i} is also uniform, and that this set is itself a simplex of the same shape as the original simplex but with "corners" (c,..., c, 1 - (n - 1)c), (c,..., 1 - (n - 1)c, c),... (1 - (n - 1)c,...,c,c). The edges of the original simplex have length Square root of 2, whereas those of the smaller simplex have length Square root of 2(1 - nc). It follows that q in (A.7) is given by
(Square root of 2(1 - nc)/Square root of 2)[sup n - 1] = (1 - nc)[sup n - 1],
and the lemma follows.
To establish (19), note that
Pr(T/S less than or equal to c) = E[sup f(s; lambda)] Pr[sup f(t|s; lambda)] (T/S less than or equal to c).
But given s, the distribution of X[sub 1], X[sub 2],..., X[sub n] is uniform on the set {X : SIGMA[sup n, sub i=1] X[sub i] = s}. Defining W[sub i] = X[sub i]/s, the conditions of Lemma A.1 clearly apply, with W[sub (1)] = T/s, and the result follows.
Computation of p[sub ppost] for Independence in Contingency Tables
From (33) and (34), it is clear that a Monte Carlo importance sampling approximation to p[sub ppost] in (33) is given by
p[sub ppost] Is approximately equal to SIGMA[sup L, sub i=1][1 - B(t[sub obs] - 1; n, theta[sub i]xi[sub i])] x pi(theta[sub i], xi[sub i]|x[sub obs] \ t[sub obs])/h(theta[sub i], xi[sub i])/ SIGMA[sup L, sub i=1] pi(theta[sub i], xi[sub i]|x[sub obs] \ t[sub obs])/h(theta[sub i], xi[sub i]),
where B(x; n, phi) is the distribution function at x of the Bi(n, phi) distribution, h(theta, xi) is some importance function, and (theta[sub 1], xi[sub 1]), (theta[sub 2], xi[sub 2]),..., (theta[sub L], xi[sub L]) are L random draws from h(.).
Importance functions that have a bounded importance ratio, pi(theta[sub i], xi[sub i]|x[sub obs] \ t[sub obs])/h(theta[sub i],xi[sub i]), and that reasonably approximate the desired distribution are useful for several reasons. First, convergence is typically rapid. Second, an explicit formula for the Monte Carlo variance is then available. Third, the scheme can be readily adapted, via acceptance-rejection, to generate an actual sample from the partial posterior, if desired. The importance function in (35) can be seen to have these properties. In particular, the importance ratio can be computed to be
pi(theta, xi|x[sub obs] \ t[sub obs])h(theta, xi)
= {c[sub 1]/2 (1 - theta xi)[sup n - x[sup o, sub 11]]/ theta[sup x[sup o, sub 21]](1 - theta)[sup n - x[sup o, sub 11] - x[sup o, sub 21]] (1 -xi)[sup x[sup o, sub 21]]
+ c[sup 2]/2 (1 - theta xi)[sup n - x[sup o, sub 11]]/ (1 - theta)[sup x[sup o, sub 12]]xi[sup x[sup o, sub 12]] (1 - xi)[sup n - x[sup o, sub 11] - x[sup o, sub 12]]}[sup -1],
where
c[sub 1] = GAMMA(n - x[sup o, sub 11] - x[sup o, sub 21] + 2)/ GAMMA(x[sup o, sub 12] + 1)GAMMA(x[sup o, sub 22] + 1)
and
c[sub 2] = GAMMA(n - x[sup o, sub 11] - x[sup o, sub 12] + 2)/ GAMMA(x[sup o, sub 21] + 1)GAMMA(x[sup o, sub 22] + 1).
It is straightforward to show that this is bounded by 2/(c[sub 1] + c[sub 2]), using the inequalities theta(1 - xi) less than or equal to (1 - theta xi) and (1 - theta) less than or equal to (1 - theta xi) judiciously. This importance function works well even for very large values of n and extreme values of theta and xi.
[Received December 1998. Revised November 1999.]
~~~~~~~~
By M. J. Bayarri and James O. Berger
M. J. Bayarri is Professor of Statistics and O.R., Department of Statistics
and O.R., University of Valencia, 46100 Burjassot, Valencia, Spain (E-mail:
susie.bayarri@uv.es). James O. Berger is Arts and Sciences Professor of
Statistics, Institute of Statistics and Decision Sciences, Duke University,
Durham, NC 27708 (E-mail: berger@stat.duke.edu). This work was supported in part
by National Science Foundation grants DMS-9303556 and DMS-9802261, and by
Ministry of Education and Culture and the Generalitat Valenciana (Spain) grants
PB96-0776 and POST99-01-7. The authors thank George Casella and Martin Tanner
for organizing these contributions to JASA, and the associate editor and three
referees for numerous helpful comments that greatly improved the article.
Title: | Asymptotic Distribution of P Values in Composite Null Models. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | We investigate the compatibility of a null model Ho with the data by calculating a p value; that is, the probability, under H[sub 0], that a given test statistic T exceeds its observed value. When the null model consists of a single distribution, the p value is readily obtained, and it has a uniform distribution under H[sub 0]. On the other hand, when the null model depends on an unknown nuisance parameter theta, one must somehow get rid of theta, (e.g., by estimating it) to calculate a p value, Various proposals have been suggested to "remove" theta, each yielding a different candidate p value. But unlike the simple case, these p values typically are not uniformly distributed under the null model. In this article we investigate their asymptotic distribution under H[sub 0]. We show that when the asymptotic mean of the test statistic T depends on theta, the posterior predictive p value of Guttman and Rubin, and the plug-in p value are conservative (i.e., their asymptotic distributions are more concentrated around 1/2 than a uniform), with the posterior predictive p value being the more conservative. In contrast, the partial posterior predictive and conditional predictive p values of Bayarri and Berger are asymptotically uniform. Furthermore, we show that the discrepancy p value of Meng and Gelman and colleagues can be conservative, even when the discrepancy measure has mean 0 under the null model. We also describe ways to modify the conservative p values to make their distributions asymptotically uniform. [ABSTRACT FROM AUTHOR] |
AN: | 3851443 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
We investigate the compatibility of a null model Ho with the data by calculating a p value; that is, the probability, under H[sub 0], that a given test statistic T exceeds its observed value. When the null model consists of a single distribution, the p value is readily obtained, and it has a uniform distribution under H[sub 0]. On the other hand, when the null model depends on an unknown nuisance parameter theta, one must somehow get rid of theta, (e.g., by estimating it) to calculate a p value, Various proposals have been suggested to "remove" theta, each yielding a different candidate p value. But unlike the simple case, these p values typically are not uniformly distributed under the null model. In this article we investigate their asymptotic distribution under H[sub 0]. We show that when the asymptotic mean of the test statistic T depends on theta, the posterior predictive p value of Guttman and Rubin, and the plug-in p value are conservative (i.e., their asymptotic distributions are more concentrated around 1/2 than a uniform), with the posterior predictive p value being the more conservative. In contrast, the partial posterior predictive and conditional predictive p values of Bayarri and Berger are asymptotically uniform. Furthermore, we show that the discrepancy p value of Meng and Gelman and colleagues can be conservative, even when the discrepancy measure has mean 0 under the null model. We also describe ways to modify the conservative p values to make their distributions asymptotically uniform.
KEY WORDS: Asymptotic relative efficiency; Bayesian p values; Bootstrap tests; Goodness of fit; Model checking.
Bayarri and Berger (1999, 2000) proposed two new "Bayesian" p values, the conditional predictive p value and the partial posterior predictive p value. They claimed that for checking the adequacy of a parametric model, these new p values are often superior to the "plug-in" (i.e., parametric bootstrap) p value and to previously proposed "Bayesian" p values: the prior predictive p value of Box (1980), the posterior predictive p value of Guttman (1967) and Rubin (1984), and the discrepancy p value of Gelman, Carlin, Stern, and Rubin (1995), Gelman, Meng, and Stern (1996), and Meng (1994). Their claim of superiority is based on extensive investigations of the small-sample properties of the various candidate p values in specific examples. In this article, we investigate their large-sample properties and find that our asymptotic results indeed confirm the superiority of the conditional predictive and partial posterior predictive p values.
In Section 2 we state two theorems that characterize the asymptotic distributions of the candidate p values; their proofs are given in Section 5. In Section 3 we study three examples that vividly illustrate the advantages of Bayarri and Berger's proposals. For certain models, however, the new p values may be difficult to compute, and alternative approaches would be useful. One such approach, discussed in Section 4, is to appropriately modify the test statistic or discrepancy measure to make the plug-in, posterior predictive, and discrepancy p values asymptotically uniform. This approach is particularly successful for the discrepancy p value, in that we derive a test based on a particular discrepancy measure that is both asymptotically uniform and locally most powerful against prespecified alternatives. But this discrepancy can itself be difficult to compute in complex models. The remainder of this section is devoted to a broad overview of the main concerns and results of the article.
1.1 Overview
Suppose that we have observed a realization x[sub obs] of a random variable X. We posit a parametric "null" model, H[sub 0]: f(x; theta), theta is an element of THETA subset R[sup p], for the density of X, and wish to investigate the compatibility of the null model with the observed data x[sub obs]. We do so by comparing the distribution of a given test statistic T = t(X) with its observed value t[sub obs] = t(x[sub obs]), using the p value
(1) p(x[sub obs]) equivalent to Pr[sup m(.)][t(X) > t[sub robs]]
as a measure of compatibility, where m(x) equivalent to m[sub X](x) is a reference density for X and m(t) equivalent to m[sub T](t) the corresponding marginal density of T, and the superscript m(.) signifies that X has density m(x).
One approach to testing compatibility of the null model f(x; theta) is to embed it into a larger parametric model,
(2a) f(x; psi, theta), (psi, theta) is an element of PSI x THETA,
in which psi = 0 corresponds to the null model H[sub 0]; that is,
(2b) f(x; 0, theta) = f(x; theta), theta is an element of THETA,
whereas psi not equal to 0 characterizes alternatives to H[sub 0]. When f(x; psi, theta) truly represents all alternatives thought likely to be true when Ho is not true, Bayesian statisticians tend to forego the use of p values in lieu of Bayes factors or a full Bayesian analysis. But when f(x; psi, theta) is simply used to represent alternatives to H[sub 0] that are substantively important to detect, or when no alternative model is specified, many Bayesian statisticians join with their frequentist counterparts and use p values as measures of compatibility.
1.2 Candidate p Values
To calculate the p value (1), a reference density m must be chosen. If the null model consists of a single density f(x; theta), there is universal agreement that m(x) should be f(x; theta). Then p(X) is uniform when H[sub 0] is true, where p(X) denotes the random variable whose observed value is p(x[sub obs]). When the parameters space THETA is not a singleton (i.e., H[sub 0] is composite), one must eliminate the unknown "nuisance" parameter theta to obtain a reference density m in (1). Bayarri and Berger (2000) considered various candidates for m, each resulting in a different candidate p value. For example, the plug-in p value p[sub plug] uses m[sub plug](x|x[sub obs]) = f(x; theta[sub obs]), where (theta[sub obs] maximizes f(x[sub obs]; theta); note that we write m(.) in (1) as m[sub plug](.|x[sub obs]) to stress its dependence on the observed data x[sub obs]. The reference densities for p[sub plug] and for other candidate p values based on the statistic t(X) are reported in Table 1. Most of the p values considered by Bayarri and Berger are called "Bayesian" p values, because they assume a (possibly improper) prior density pi(theta) for theta. These include the prior predictive p value p[sub prior] of BOX (1980) and the posterior predictive p value p[sub post] of Guttman (1967) and Rubin (1984), which use the prior and posterior predictive densities as references. Bayarri and Berger (1999, 2000) added two new proposals, the partial posterior predictive p value (p[sub ppost]) and the conditional predictive p value (p[sub cpred]). We also study two additional candidate p values that were not considered by Bayarri and Berger (2000): the conditional plug-in p value p[sub cplug], which uses the maximizer theta[sub cMLE,obs] of the conditional likelihood f(x[sub obs]|t[sub obs]; theta) as a plug-in, and the discrepancy p value p[sub dis] of Gelman et al. (1995), Gelman et al. (1996), and Meng (1994) which replaces the test statistic t(X) by a discrepancy t(X, theta), a function of the data X and of the parameter theta, so that
p[sub dis] = p[sub dis](x[sub obs]) = Pr[sup m[sub dis](.)][t(X, theta) > t(x[sub obs], theta)],
with m[sub dis](x, theta|x[sub obs]) = f(x; theta)pi[sub post](theta|x[sub obs]).
1.3 Desirable Sampling Properties of Candidate p Values
First, we present some terminology. We call the random variable p(X) a candidate p value if it has range [0, 1]; if it is also uniform under H[sub 0], then we say that p(X) is a frequentist p value. When a candidate p value is not uniform, we say that it is conservative (anticonservative) at theta if Pr|p(X) < t] is less (greater) than t for all t < 1/2 when H[sub 0]: X is similar to f(x; theta), theta is an element of THETA is true. Finally, a candidate p value is globally conservative (anticonservative) if it is conservative (anticonservative) for all theta is an element of THETA.
This terminology was motivated by the following considerations. All candidate p values in Table 1 have range [0, 1], but because H[sub 0] is composite, they may not be uniformly distributed, even when H[sub 0] is true. Yet in practice, we use small values of p(x[sub obs]) to denote surprise or incompatibility because, in analogy with the noncomposite case, we act as if p(X) was U[0, 1] under H[sub 0]. Seriously anticonservative candidate p values may cause us to discard the null model even when it is quite compatible with the data, whereas seriously conservative candidates may cause us to fail to discard models that are grossly incompatible with the data. Examples are given in Section 3.
The essential point is that a p value is useful for assessing compatibility of the null model with the data only if its distribution under the null model is known to the analyst; otherwise, the analyst has no way of assessing whether or not observing p = .25, say, is surprising, were the null model true. That we specify that distribution to be uniform is largely a matter of convention. A useful analogy is as follows. It is a matter of convention whether temperatures are reported on the centigrade versus the Fahrenheit scale; however, if we are told that the temperature is 30 degrees, then it is essential that we are also told the scale if we are to know whether to plan to go swimming or skiing.
Hence, for frequentist testing purposes, we should require that candidate p values be frequentist p values. This requirement is generally unfulfillable, with the exception of special models, many of which were discussed by Bayarri and Berger (2000), but often can be approximately satisfied in large samples, particularly when, as we assume, the data X arise from n mutually independent random variables. Then m in (1) can be chosen so that p(X) is an asymptotic frequentist p value; that is, one whose distribution converges in law to a U[0, 1] distribution under H[sub 0]: X is similar to f(x; theta) for all theta is an element of THETA, as n arrow right infinity.
We next argue that Bayesian statisticians who use p values to assess the compatibility of a model with the data should require them to be asymptotic frequentist p values. For if the goal is to check the model rather than the prior, then any procedure should perform adequately whatever the prior, including point-mass priors. This would imply that p values should be required to be frequentist p values, a requirement which, as mentioned earlier, usually cannot be fulfilled. But because as the sample size increases, the data dominate any prior with support on all of the THETA, Bayesians should both expect and require that any model checking procedure should perform adequately in the limit as n arrow right infinity. Of course, not all Bayesians would agree with this argument; Bayarri and Berger (1999), Box (1980), Evans (1997), and Meng (1994), have provided some alternate viewpoints.
1.4 Centering of Test Statistics
The following discussion applies to all of our candidate p values except the discrepancy. In most statistics texts, discussions of the asymptotic distribution of tests of fit for a null model H[sub 0]: f(x; theta), theta is an element of THETA restrict attention to statistics t(X) such as the score, likelihood ratio, or Wald test of the hypothesis psi = 0 in a larger model (2a)-(2b), or general chi-squared goodness-of-fit statistics, which are asymptotically pivotal with distribution F, often a chi-squared or a standard normal distribution independent of theta is an element of THETA. Then m[sub T](t) in (1) is the density of T = t(X) corresponding to F, which does not depend on the observed data x[sub obs]. In contrast, in the Bayesian p value and parametric bootstrap literature, the limiting distribution of t(X) often depends on 0 and the reference density m[sub T](t|x[sub obs]) depends on x[sub obs], although in the bootstrap context attention is generally restricted to statistics t(X) whose asymptotic mean is independent of theta. We show in Theorem 1 that under regularity conditions, all of the aforementioned candidate p values, with the exception of the prior predictive p value, are asymptotic frequentist p values when the asymptotic mean of t(X) does not depend on theta.
Remark 1. We have not yet discussed the most common definition of a frequentist p value,
[Multiple line equation(s) cannot be represented in ASCII text]
The p value p[sub sup](X), like p[sub prior](X), need not be an asymptotic frequentist p value if the limiting distribution of t(X) depends on theta, even if the asymptotic mean of t(X) does not vary with theta. For this reason, we do not consider either p[sub sup](X) or p[sub prior](X) further.
Many of the test statistics considered in the Bayesian p value literature have asymptotic means that depend on the parameter theta; three examples illustrate this in Section 3. In the remainder of the article, we study the consequences of allowing the asymptotic mean of t(X) to depend on 0. We restrict attention to statistics t(X) with a normal limiting distribution. Extensions to statistics with limiting chi-squared or folded normal distributions are immediate, and asymptotic results for statistics with other limiting distributions will be pursued elsewhere.
If the asymptotic mean of t(X) varies with 0, then the plug-in and posterior predictive p values will be conservative even as n arrow right infinity, with the former always the less conservative, whereas the conditional plug-in p value p[sub cplug] will be anticonservative. In contrast, under regularity conditions, the partial posterior predictive and the conditional predictive p values are asymptotic frequentist p values. Further, the asymptotic power of the nominal alpha-level test based on p[sub post] against local Pitman alternatives is always less than the power of the test based on p[sub plug], itself less than the power of the tests based on p[sub ppost] and p[sub cpred]. In fact, we show that in certain examples, the asymptotic relative efficiency (ARE) of the partial posterior predictive or conditional predictive test compared to a locally efficient likelihood ratio or score test is 1, whereas the ARE of the posterior predictive test can be 10[sup -2] or less, and, consequently, its power much less than the nominal alpha-level; see Section 3 for examples.
In the proof of Corollary 3 in Section 5, we show that the posterior predictive and plug-in p values are conservative when the maximum likelihood estimator (MLE) theta and t(X) are asymptotically correlated, regardless of the sign of the correlation. Furthermore, we show that theta and t(X) will be asymptotically correlated whenever the asymptotic mean of t(X) depends on theta. Thus, as pointed out by Berger and Bayarri (1999, 2000) and Evans (1997), the problem with these p values is that t(X) is effectively used twice: first to estimate theta, and again to assess lack of fit. In contrast, the conditional MLE, theta[sub cMLE], and t(X) are always asymptotically uncorrelated, which is why p[sub ppost], p[sub cpred], and p[sub cplug] are not conservative. But although p[sub ppost] and p[sub cpred] are asymptotically uniform, p[sub cplug] is anticonservative, because it fails to properly account for the variability of theta[sub cMLE]. In proposing p[sub ppost] and p[sub cpred], both motivated through Bayesian arguments, Bayarri and Berger have solved the "frequentist" math problem of finding a reference density m(.|x[sub obs]) such that the p value (1), based on an arbitrary statistic t(X) with a limiting normal distribution, is an asymptotic frequentist p value. What is curious is that the obvious frequentist guesses, m[sub plug](.|x[sub obs]) and m[sub cplug](.|x[sub obs]), fail.
The preceding considerations do not apply to the discrepancy p value. Specifically, we show that p[sub dis] can be seriously conservative even when the discrepancy t(X, theta) has asymptotic mean 0 under f(x; theta) for all theta is an element of THETA.
To formally study the large-sample properties of our candidate p values, we consider the following canonical setup. At sample size n, the data are X equivalent to X[sub n] = (X[sub 1], ..., X[sub n]), where the X[sub i] are mutually independent random variables, each following a parametric model f[sub i](x; psi[sub n], theta) with psi[sub n] is an element of PSI subset R[sup 1] and theta is an element of THETA subset R[sup p]. Thus the likelihood is
[Multiple line equation(s) cannot be represented in ASCII text]
Unlike Robins (1999), we do not assume the X[sub i] to be identically distributed to allow for regression models in which the regressors are regarded as fixed constants, as in Example A of Section 3. The subscript n in psi[sub n] indicates that the unidimensional nuisance parameter is allowed to vary with n; that is, psi[sub n] = 0 for all n under H[sub 0], and psi[sub n] = k[sub n]/Square root of n under local Pitman alternatives, where k[sub n] arrow right k is an element of R[sup 1] as n arrow right infinity. Note that when psi[sub n] = 0, we frequently write the data model f(x; psi[sub n], theta) = f(x; 0, theta) more simply as f(x; theta) and, for notational convenience, often suppress the subscript n denoting sample size in quantities such as X equivalent to X[sub n].
Attention is restricted to univariate test statistics t(X) that are asymptotically normal with asymptotic mean nu[sub n](k[sub n]/square root of n, theta) and asymptotic variance sigma[sup 2](theta) under the null and local alternatives; that is, we assume that when X is similar to f(x; k[sub n]/square root of n, theta),
(3a) n[sup 1/2][t(X) - nu[sub n](k[sub n]/square root of n, theta)/sigma(theta)] arrow right N(0, 1),
where arrow right denotes convergence in distribution. Note that because local alternatives are contiguous (van der Vaart 1998, chap. 6), the asymptotic variance sigma[sup 2](theta) does not depend on k. We also assume that (3a) holds for sequences theta = theta[sub 0] + k[sup *]/square root of n for any fixed theta[sub 0] and k[sup *].
We further assume that the functions nu[sub n] are continuously differentiable in a neighborhood of (0, theta), with partial derivatives converging to limits as n arrow right infinity. Thus
(3b) [Multiple line equation(s) cannot be represented in ASCII text]
and
(3c) [Multiple line equation(s) cannot be represented in ASCII text]
both exist. Note that nu[sub theta](theta) is a p vector that is nonzero only when the asymptotic mean nu[sub n](theta) = nu[sub n](0, theta) of t(X) under H[sub 0] depends on theta . In Theorem 3, we prove that under mild additional conditions, *(These characters cannot be converted in ASCII text) and *(These characters cannot be converted in ASCII text) are equal to the asymptotic covariances of n[sup 1/2][t(X) - nu[sub n](theta)] with n[sup -1/2]S[sub psi](theta) = n[sup -1/2]differential log f(X; psi, theta)/differential psi|[sub psi=0] and n[sup -1/2]S[sub theta](theta) = n[sup -1/2]differential log f(X;0, theta)/differential theta, where S[sub psi](theta) and S[sub theta](theta) are the scores for psi and theta at psi = 0.
We also need to define the scalars
(4) OMEGA(theta) = nu[sub theta](theta)'[sup -1, sub theta theta](theta)nu(theta)
and
(5) omega(theta) - nu[sub psi(theta) - nu[sub theta(theta)'i[sup -1, sub theta theta] (theta)i[sub theta psi](theta)
and me noncentrality parameter
(6) NC(theta) = omega(theta)/[sigma[sup 2](theta) - OMEGA(theta)][sup 1/2],
where i[sub theta theta](theta) = lim[sub n arrow right infinity]n[sup -1]E[sub theta] [-differential[sup 2]log f (X; 0, theta)/differential theta differential theta'] = lim[sub n arrow right infinity]n[sup -1]E[sub theta][S[sup (X)2, sub theta](theta)]] and i[sub theta psi](theta) = lim[sub n arrow right infinity]n[sup -1]E[sub theta][S[sub psi](theta) S[sub theta](theta). Here and elsewhere, E[sub theta] denotes expectation with respect to f(x; theta). Note that OMEGA(theta) and sigma(theta), in contrast with omega(theta) and NC(theta), depend only on the null model f(x; theta).
We are now ready to state our main theorem, which we subsequently interpret in a series of remarks.
Theorem 1. Subject to the assumptions of Theorems 3 and 4, under law f(x; k[sub n]/square root of n, theta), each candidate p value can be written as
p(X) = 1 - PHI(Q) + o[sub o](1),
where o[sub p](1) denotes a random variable converging to 0 in probability, PHI is the standard normal cdf, and Q = q(X) is similar to N(k mu(theta), tau[sup 2](theta)), with mu(theta) = tau(theta)NC(theta). The values of tau[sup 2](theta) for our candidates are as follows:
Plug-in: tau[sup 2, sub plug](theta) = [sigma[sup 2](theta) -OMEGA(theta)]/sigma[sup 2](theta)
Posterior predictive: tau[sup 2, sub post](theta) = [sigma[sup 2](theta) - omega(theta)]/[sigma[sup 2](theta) + OMEGA(theta)]
Partial posterior predictive: tau[sup 2, sub ppost](theta) = 1
Conditional predictive: tau[sup 2, sub cpred](theta) = 1
Conditional plug-in: tau[sup 2, sub cplug](theta) = sigma[sup 2](theta)/[sigma[sup 2](theta) - OMEGA(theta)]
Remark 2: Asymptotic Frequentist p Values. Theorem 1 implies that a candidate p value is an asymptotic frequentist p value under H[sub 0] (i.e., k = 0) if and only if tau[sup 2](theta) = 1. Hence all candidate p values referred to in Theorem 1 are asymptotic frequentist p values when nu[sub theta](theta) = 0, because then OMEGA(theta) in (4) is 0. When nu[sub theta](theta) not equal to 0, with tau[sup 2](theta) < 1 (tau[sup 2](theta) > 1), the p value is conservative (anticonservative). Hence p[sub cplug] is anticonservative, whereas p[sub ppost] and p[sub plug] are conservative, with p[sub ppost] being the more conservative because tau[sup 2, sub ppost] (theta) < tau[sup, 2, sub plug](theta).
Remark 3: Efficiency. Let chi(alpha) = I[p(X) < alpha] denote the nominal alpha-level test that rejects H[sub 0] whenever chi(alpha) = 1; here I[A] is the indicator function for event A that takes value 1 if A is true and 0 otherwise.
The asymptotic power of test chi(alpha) is
(7) [Multiple line equation(s) cannot be represented in ASCII text]
where E[sub k]/square root of n, theta refers to expectations under f(x; k/square root of n, theta). The asymptotic representation of p(X) given in Theorem 1 implies that chi(alpha) has beta(alpha, k, theta) = 1 -PHI[z[sub 1-alpha]tau[sup -1](theta) - kNC(theta)]. In model f(x; psi, theta), a locally most powerful asymptotic alpha-level test chi[sub eff](alpha) of the hypothesis psi = 0 has asymptotic power 1 - PHI[z[sub 1-alpha] - kNC[sub eff](theta)], where NC[sub eff](theta) is the efficient noncentrality parameter NC[sub eff](theta) = {i[sub psi psi](theta) - i[sub psi theta](theta)'i[sup -1, sub theta theta](theta)i[sub theta psi])(theta)}[sup 1/2], with i[sub psi psi](theta) = limn[sub n arrow right infinity] n[sup -1]E[sub theta][S[sub psi](theta)[sup 2]](see van der Vaart 1998, chap. 15). The following lemma indicates that a sufficient condition for chi[sub ppost](alpha) and chi[sub cpred](alpha) to be locally most powerful at a particular value theta[sup *] of theta is that t(X) is asymptotically equivalent to an affine transformation of S[sub psi](theta[sup *]), because then NC(theta[sup *]) = NC[sub eff](theta[sup *]); the proof is in Section 5.
Lemma 1. Under regularity conditions, if n[sup 1/2]t(X) = an[sup -1/2]S[sub psi](theta[sup *]) + b + o[sub p](1) under f(x; theta[sup *]), for some constants a and b with a not equal to 0, then NC(theta[sup *]) = NC[sub eff](theta[sup *]).
Remark 4: Actual Asymptotic Level and Relative Power and Efficiency. The asymptotic actual alpha-level of test chi(alpha) is beta(alpha, k, theta) evaluated at k = 0, which we denote by actual(alpha, theta). As a function of alpha, actual(alpha, theta) is the cdf of the asymptotic distribution of p(X) under H[sub 0]. We define the asymptotic relative power (ARP) of test chi(alpha), denoted by ARP = ARP(alpha, beta, theta), as its asymptotic power under alternative f(x; k[sub ppost]/square root of n, theta), with k[sub ppost] chosen such that the asymptotic power of chi[sub ppost](alpha) is beta; that is, with k[sub ppost] such that beta[sub ppost](alpha, k[sub ppost], theta) = beta. Finally, the asymptotic relative efficiency ARE = ARE (alpha, beta, theta) of candidate test chi(alpha) is the limit, as n[sub cand] arrow right infinity, of the ratio n[sub ppost]/n[sub cand], where n[sub ppost] and n[sub cand] are the sample sizes needed for tests chi[sub ppost](alpha) and candidate test chi(alpha) to both have power beta under the alternative f(x; k/square root of n, theta).
Corollary 1. For each candidate p(X) of Theorem 1, the actual asymptotic alpha-level of the test chi(alpha) is
actual(alpha, theta) = 1 - PHI[z[sub 1-alpha]tau[sup -1](theta)].
If nu[sub psi](theta) not equal to 0, then the ARP and ARE are
ARP(alpha, beta, theta) = 1 - PHI[-z[sub beta] + z[sub 1-alpha](tau[sup -1](theta) - 1)]
and
ARE(alpha, beta, theta) = (1 + z[sub 1-alpha]/z[sub beta])[sup 2]/ (1 + tau[sup -1](theta)z[sub 1-alpha]/z[sub beta])[sup 2].
Note that neither ARP(alpha, beta, theta) nor ARE(alpha, beta, theta) depends on NC(theta), and so these two quantities are the same for all local alternatives nesting the model f(x; theta). When tau[sup 2](theta) < 1/3, ARP(alpha, 1 - alpha, theta) < alpha, so the asymptotic local power of the test chi(alpha) is less than alpha, even though the test chi[sub ppost](alpha) has asymptotic power beta = 1 - alpha.
Remark 5: Discrepancy p Values. In Section 5 we prove that the discrepancy p value and the posterior predictive p value are related as follows.
Theorem 2. Let t(X, theta) be a discrepancy measure and, for a given fixed theta[sub 0], let t(X) = t(X, theta[sub 0]). Then, under f(x; k[sub n]/square root of n, theta[sub 0]), p[sub dis](X) based on t(X, theta[sup *]) and p[sub post](X) based on t(X) have the same limiting distribution.
Moreover, if we redefine sigma[sup 2](theta) to be the asymptotic variance of n[sup l/2]t(X, theta) under f(x; 0, theta) and nu[sub theta](theta) = lim[sub n arrow right infinity]differential nu[sub n],(0, theta[sup *]: theta)/differential theta[sup *, sub |theta[sup *]=0] and nu[sub psi](theta) = lim[sub n arrow right infinity]differential nu[sub n](psi, theta: theta)/differential psi[sub |psi = 0] where nu[sub n](psi, theta[sup *]: theta) is the asymptotic mean of n[sup 1/2]t(X, theta) under f(x; psi, theta[sup *]), then Theorem I holds for p[sub dis](X) with tau[sup 2, sub dis] = tau[sup 2, sub post].
In this section we use Theorems 1 and 2 and Corollary I to compare the asymptotic properties of our candidate p values in three examples. We first report results, and then give their derivation.
3.1 Example A
Suppose that X = (X[sub 1], ..., X[sub n]), with the X[sub i]'s being mutually independent. Consider the null model X[sub i] is similar to N(gamma u[sub i], c[sup 2]), theta = (gamma, c[sup 2])', and v = (v[sub 1], ..., v[sub n])' a vector of known constants. Let the test statistic be t(X) = n[sup -1] SIGMA[sub i]X[sub i]w[sub i] = n[sup -1]X'w, where w = (w[sub 1], ..., w[sub n])' is another vector of known constants. We assume that v'v = w'w = 1 and v'1 = w'1 = 0, where 1 is the n-vector of 1's. Then the mean of t(X), E[sub theta](t(X)) = rho gamma, depends on theta = (gamma, c[sup 2])' whenever rho not equal to 0, where rho = n[sup -1]SIGM[sub i]w[sub i]v[sub i] = n[sup -1]w'v. Figure 1 shows the asymptotic cdf's actual(alpha) = actual(alpha, theta) of various candidate p values under the null model, for several choices of rho. These depend on w and v only through rho, and they do not depend on the value of theta generating the data. As expected, the plug-in and posterior predictive p values are conservative when rho not equal to 0, with the plug-in the less conservative. Indeed, both are converging to a point mass at 1/2 as the empirical correlation rho of the regressors approaches 1. In contrast, p[sub cplug] is anticonservative. In particular, as rho arrow right 1, actual[sub cplug](alpha, theta) converges to 1/2 for all alpha < 1/2.
Let the alternative model be N(psi, w[sub i] + gamma v[sub i], c[sup 2). Then the score, S[sub psi](theta) = n(t(X) - rho gamma)/c[sup 2] is an affine transformation of t(X), and so chi[sub ppost](alpha) is locally most powerful for testing psi = 0. Figure 2 displays actual(alpha, theta), ARP(alpha, beta, theta), and ARE(alpha, beta, theta) as functions of rho, with alpha = .05 and beta = .80, for several candidate p values. These functions do not depend on the value of theta = (gamma, c[sup 2])' generating the data. Figure 2(b) shows that when rho not equal to 0, the power of both the plug-in and posterior predictive tests are less than .80, with the latter the smaller; and as rho arrow right 1, both power functions fall below the nominal alpha-level of .05 as they converge to 0.
What is disturbing is that the performances of both the plug-in and posterior predictive p values and tests depend on the "correlation" rho of the regressors, which is ancillary. For example, suppose that p[sub post] = .25 was reported by the data analyst. If we know that rho = 0, from inspection of Figures 1 and 2, then we could conclude that the data and model appear compatible, but would reach the opposite conclusion if rho = .99. In our previous weather analogy, rho would be the temperature scale, p[sub post] the temperature, and the decision whether to use or discard the null model that to go swimming or skiing. However, it is rare for rho to even be reported by the analyst, in which case reaching an appropriate decision would be impossible. Yet even if rho were reported, p[sub post] would not be interpretable by the consumer as either compatible or incompatible with the data without the benefit of the additional detailed mathematical analysis that we used to create the plots in Figures 1 and 2. Such analysis is beyond the capabilities of most consumers of statistical reports. It would be as if an American schoolchild was told that the temperature was 30 Celsius but had never been taught the centigrade scale.
Consider now the discrepancy p value based on the average score t(X, theta) = n[sup -1]S[sub psi](theta) = (c[sup 2]n)[sup -1]SIGMA[sub i](X[sub i] - gamma v[sub i])w[sub i]. Note that the mean of t(X, theta) under the null model, X[sub i] is similar to N(gamma v[sub i], c[sup 2]), is 0 for all theta. Nonetheless, because t(X) > t(x[sub obs]) iff t(X, theta) > t(x[sub obs], theta), it follows that p[sub dis](X) = p[sub post](X) with probability 1 under both the null and alternative models. Thus the curves for p[sub dis](X) and p[sub post](X) in Figures 1 and 2 are indistinguishable. And hence when rho not equal to 0, the discrepancy p value is conservative, and the corresponding test is inefficient. This result can also be derived directly from Theorem 2 or Corollary 2 below.
The curves in Figures 1 and 2 remain unchanged under the submodel in which the variance c[sup 2] is known and thus theta = gamma. Results of Bayarri and Berger (2000) imply that under this submodel, the cdf's, ARPs and AREs in Figures 1 and 2 are exact at each sample size n, under the noninformative prior pi(gamma) proportional to 1.
3.2 Example B
Stigler (1977) provided data on Simon Newcomb's n = 66 measurements for estimating the speed of light, with each measurement X[sub i] recorded as a deviation from 24,800 nanoseconds. Gelman et al. (1995, sec. 2.2) modeled these data as n iid draws from a N(mu, c[sup 2]) distribution, with a noninformative uniform prior on (mu, log c). To look for incompatibility of the data with the N(mu, c[sup 2]) model in the left tail of the distribution, they computed a posterior predictive p value with t(X) = min{X[sub i]; i = 1,..., n}, the first-order statistic (Gelman et al. 1995, p. 166). A reasonable alternative choice for T = t(X) would be the empirical qth quantile of X, T = Z[sub q] = sup{t;n[sup -1]SIGMA[sub i](X[sub i] is less than or greater than t) < q}, for a small value of q (say, q = .05), which we use in place of the first-order statistic because, in contrast to the latter, it is asymptotically normal and covered by our large-sample theory. The asymptotic mean of T = Z[sub q] under the normal null model is the population qth quantile of a N(mu, c[sup 2]) distribution; that is, z[sub q](theta) = z[sub q](mu, c[sup 2]) = z[sub q]c + mu, with z[sub q] = z[sub q](0, 1), which depends on theta = (mu, c[sup 2]). Hence we expect both the plug-in and posterior predictive p values to be asymptotically conservative. Figure 3 shows the curves actual(.05, theta), ARP(.05, .8, theta), and ARE(.05, .8, theta) for our candidate p values as a function of q, none of which depends on the value of theta generating the data. Note that as discussed in Corollary 1 we did not have to specify the alternative model f(x; psi, theta) under which the ARP and ARE are calculated.
A natural discrepancy measure generalizing the test statistic t(X) = Z[sub q] is t(X, theta) = Z[sub q] - z[sub q](theta), the difference between the empirical and the true qth quantiles of the null model. Because t(X) > t(x[sub obs]) iff t(X, theta) > t(x[sub obs], theta), p[sub dis](X) and p[sub post](X) are equal with probability 1, and so they have identical distributions.
3.3 Example C
Gelman et al. (1995, pp. 171-172) also analyzed Newcomb's speed of light data using a discrepancy p value based on
t(X, theta) = |Z[sub 1-q] - mu| - |mu - Z[sub q]|
with q = .1, to check whether or not the magnitude of skewness, as measured by t(x[sub obs], theta), was compatible with a N(mu, c[sup 2]) distribution. Note that under the null model N(mu, c[sup 2]), t(X, theta) has asymptotic (and exact) mean 0 for all theta = (mu, c[sup 2]), because the X[sub i] have a symmetric distribution centered at mu. A natural test statistic related to the discrepancy t(X, theta) is t(X) = Z[sub 1-q] + Z[sub q] because, on a set with probability going to 1, t(X, theta) = Z[sub 1-q] + Z[sub q] - 2 mu under the null model and any local alternative. On this set, t(X) > t(x[sub obs]) iff t(X, theta) > t(x[sub obs], theta), and so p[sub dis](X) based on t(X, theta), and p[sub post](X) based on t(X), have the same asymptotic distribution. Figure 4 shows actual(.05, theta), ARP(.05, .8, theta), and ARE(.05, .8, theta) for our candidate p values based on t(X), as functions of q, none of which depends on the value of theta generating the data.
3.4 Derivation of the Results
We now show how the quantities sigma(theta), OMEGA(theta), and NC(theta) were obtained for the test statistics and discrepancies in Examples AC.
Example A. Here theta' = (theta[sub 1], theta[sub 2]) = (gamma, c[sup 2]), X[sub i] is similar to N(psi w[sub i] + gamma v[sub i], c[sup 2]), and t(X) = n[sup -1]X'w. We assume that at each sample size n, the vectors of constants (v, w) = (v[sub n], w[sub n]) are chosen such that n[sup -1]v'v, n[sup -1]w'w, and rho = n[sup -1]w'v do not depend on n. Then the asymptotic mean of t(X) is nu[sub n](psi, theta) = psi + theta[sub 1]rho. Also, nu[sub theta](theta) = (rho, 0)', nu[sub psi](theta) = 1, sigma[sup 2](theta) = theta[sub 2], i[sub theta theta](theta) = diag(theta[sup -1, sub 2], 1/2theta[sup 2, sub 2]), i[sub psi theta](theta)' = (theta[sup -1, sub 2]rho, 0), OMEGA(theta) = rho[sup 2]theta[sub 2], and omega(theta) = 1 - rho[sup 2]. For the discrepancy, t(X, theta) = n[sup -1]S[sub psi](theta) = n[sup -1]theta[sup -1, sub 2]SIGMA[sub i](X[sub i] - theta[sub 1]v[sub i])w[sub i], nu[sub n](psi, theta[sup *]: theta) = E[sub psi, theta[sup *]{t(X, theta)} = (theta[sup *, sub 2])[sup -1]{psi + (theta[sup *, sub 1] -theta[sub 1])rho}. Hence nu[sub theta](theta) = (rho, 0)'theta[sup -1, sub 2], nu[sub psi](theta) = theta[sup -1, sub 2], sigma[sup 2](theta), theta[sup -1, sub 2], OMEGA(theta) = theta[sup -1, sub 2]rho[sup 2], and omega(theta) = (1 - rho[sup 2])theta[sup -1, sub 2].
Example B. Here t(X) = Z[sub q], and under f(x; theta), X[sub i] is similar to N(mu, c[sup 2]) with theta' = (theta[sub 1], theta[sub 2]) = (mu, c[sup 2]). Then nu[sub n](0, theta) = theta[sup 1/2, sub 2]z[sub q] + theta[sub 1], and hence nu[sub theta](theta) = (1, theta[sup - 1/2, sub 2]z[sub q]/2), i[sub theta theta](theta) = diag(1/theta[sub 2], 1/2theta[sup 2, sub 2]), and OMEGA(theta) = theta[sub 2](1 + z[sup 2, sub q]/2). Further, it is well known that if the X[sub i] are lid under any law f(x; theta), then
[Multiple line equation(s) cannot be represented in ASCII text]
Hence sigma[sup 2](theta) = f(z[sub q](theta); theta)[sup -2]var[sub theta]{I[X[sub i] < z[sub q](theta)]}, which, for our N(theta[sub 1], theta[sub 2]) model, evaluates to
sigma[sup 2](theta) = theta[sub 2]phi[sup -2](z[sub q])q(1 - q),
where phi is the standard normal density. Further, under f(x; k[sub n]/square root of theta), it follows from Theorem 3 that nu[sub psi](theta) = E[sub theta][-f(z[sub q](theta); theta)[sup -1]I(X[sub i] < z[sub q](theta))S[sub psi, i](theta)], where S[sub psi,i](theta) is the contribution of subject i to S[sub psi](theta). For the discrepancy t(X, theta) = Z[sub q] - z[sub q](theta), all of the relevant foregoing quantities are the same as for t(X) = Z[sub q], but with nu[sub n](0, theta[sup *]: theta) = z[sub q](theta[sup *]) - z[sub q](theta) = theta[sup * 1/2, sub 2]z[sub q] + theta[sup *, sub 1] - (theta[sup 1/2, sub 2]z[sub q] + theta[sub 1]).
Example C. In this example, t(X) = Z[sub q] + Z[sub 1-q], where X[sub i] is similar to N(mu, c[sup 2]), theta' - (theta[sub 1], theta[sub 2]) = (mu, c[sup 2]); so as in Example B, i[sub theta theta](theta) = diag(1/theta[sub 2], 1/2theta[sup 2, sub 2]). It follows from the results of Example B that nu[sub n](0, theta) = theta[sup 1/2, sub 2](z[sub q] + z[sub 1-q]) + 2theta[sub 1], and so that nu[sub theta](theta) = (2, 0), because z[sub q] + z[sub 1-q] = 0. Thus OMEGA(theta) = 4theta[sub 2]. Further, it follows from (4) that sigma[sup 2](theta) = var[sub theta]{-f(z[sub q](theta);theta)[sup -1](X[sub i] < z[sub q](theta)) - f(z[sub 1 - q](theta);theta)[sup -1](X[sub i] < z[sub 1-q](theta))}, which for our normal model evaluates to sigma[sup 2](theta) = 2q phi[sup -2](z[sub q]). All the relevant foregoing quantities are the same for the discrepancy measure t(X, theta) = |Z[sub 1-q] - mu| - |mu - Z[sub q]| as for t(X) = Z[sub q] + Z[sub 1-q], but with nu[sub n](0, theta[sup *]: theta) = 2(theta[sup *, sub 1] - theta[sub 1]).
It follows from Theorems 1 and 2 and Corollary l that from an asymptotic frequentist viewpoint, p[sub cpred] and p[sub ppost] are preferred to our other candidate p values when the asymptotic mean of t(X) depends on theta. This result still leaves open the question of why we should want to use a test statistic with nonconstant asymptotic mean. Bayarri and Berger (2000) argued that this is desirable because it does not restrict the choice of possible measures t(X) of departure from the null model; the preferred, or intuitive, choice may happen to have a nonconstant asymptotic mean. So suppose that our choice t(X) satisfies (3a)-(3c), with nu[sub theta](theta) not equal to 0. Now p[sub cpred] and p[sub ppost] are sometimes difficult to compute (Bayarri and Berger 1999, 2000; Pauler 1999), and alternative approaches might be useful. We consider two; the first is to replace t(X) with a closely related test statistic that has a constant asymptotic mean; the second approach is to adjust (i.e., calibrate) those candidate p values with nonuniform asymptotic distributions. We also describe how to modify a discrepancy measure so that the discrepancy p-value is asymptotically uniform.
4.1 Modifications of t(X) and t(X, theta)
One alternative to computing p[sub cpred] or p[sub ppost] based on t(X) is to compute p[sub plug] or p[sub post] based on the statistic t(X) = t(X) - nu[sub n](theta), where nu[sub n](theta) equivalent nu[sub n](0, theta), because, by a Taylor expansion, t(X) is asymptotically normal with constant asymptotic mean 0 and asymptotic variance
c[sup 2](theta) = sigma[sup 2](theta) - nu[sub theta](theta)'i[sup -1, sub theta theta](theta(nu[sub theta](theta)
under the null model. It then follows from Theorem 1 that p[sub plug](X) and p[sub post](X) calculated using t(X) are asymptotic frequentist p values with limiting distribution under f(x:k[sub n]/square root of n, theta), equal to that of p[sub ppost](X) and p[sub cpred](X), because NC(theta) is the same for t(X) and for t(X). But because in general t(X) is not asymptotically pivotal [as its asymptotic variance c[sup 2(theta) depends on theta], p[sub plug] and p[sub post] often will be calculated by simulation using the fact that, for example,
[Multiple line equation(s) cannot be represented in ASCII text]
where X[sup (k)] = (X[sup (k), sub 1], ..., X[sup (k), sub n]) are K independent draws from f(X; theta[sub obs]), and t(x[sub obs]) = t(x[sub obs]) - nu[sub n](theta[sub obs]). A potential drawback of this approach is that to evaluate t(X[sup (k)]) = t(X[sup (k)]) - nu[sub n](theta[sup (k)]), the maximizer theta[sup (k)] of f(X[sup (k)]; theta) must be recomputed for each simulated dataset X[sup (k)]. The computational difficulties could be overcome by substituting for theta[sup (k)] a single Newton-step estimator starting from the original MLE theta. Similarly, in the case of p[sub post], the posterior distribution of theta must be recomputed for each dataset X[sup (k)], which also may be computationally impractical. Again, the computational difficulties could be overcome by substituting an easy-to-compute normal approximation to the posterior.
To avoid having to recompute (either exactly or approximately) the MLE or the posterior density of theta for each simulated dataset, two additional approaches may be considered, both of which give p values that are asymptotically equivalent to p[sub ppost](X) under both the null model and local alternatives. The first approach is to replace t(X) by the asymptotically pivotal N(0, 1) random variable t(X) = t(X)/c(theta). We then obtain an asymptotic frequentist p value p[sub pivot] based on t(X) by using m(t) = N(0, 1) in (1); specifically, p[sub pivot] = 1 - PHI[t(x[sub obs])]. The second alternative is to calculate a discrepancy p value based on the discrepancy t(X, theta) = t(X) -nu[sub n](theta) - nu[sub theta](theta)'[sup -1, sub theta theta] (theta)n[sup -1]S[sub theta](theta).
Indeed given any discrepancy t(X, theta) with E[sub theta][t(X, theta)] = 0 the discrepancy p-value p[sub dis] based on the modified discrepancy
t(X, theta) = t(X, theta) - nu[sub theta](theta)'i[sup -1, sub theta theta] (theta)n[sup -1]S[sub theta](theta),
is, by Theorem 3, uncorrelated with S[sub theta](theta) and thus an asymptotic frequentist p value with nu[sub theta](theta) as defined in Theorem 2. As emphasized by Meng (1994), neither the MLE nor the posterior distribution of theta needs to be recomputed when calculating p[sub dis] by simulation.
A drawback of these latter approaches is that they can require computation or estimation of sigma[sup 2](theta), nu[sub n](theta), nu[sub theta](theta), and/or i[sub theta theta](theta), which may be computationally difficult. But an important advantage of the last approach is that if we take t(X, theta) equal to n[sup -1]S[sub psi](theta), then t(X, theta) becomes the "efficient score" discrepancy
(8) t(X, theta) = n[sup -1]S[sub psi](theta) - i[sub psi theta](theta)'i[sup -1, sub theta theta] (theta)n[sup -1]S[sub theta](theta)
and the test chi[sub dis] based on p[sub dis] is a locally most powerful asymptotic alpha-level test of psi = 0.
4.2 Adjusted p Values
Another class of alternatives to p[sub cpred] or p[sub ppost] are the adjusted (i.e., calibrated) p values p[sub post,adj], p[sub plug,adj], and p[sub cplug,adj], where for any candidate p value p(X) with observed value p = p(x[sub obs]),
p[sub adj] equivalent to p[sub adj](x[sub obs] equivalent to F[sub p(x)][p; theta[sub obs]],
where F[sub p](x)(u; theta) is the cdf of p(X) when X is similar to f(x; theta) (Davison and Hinkley 1997, p. 132). Beran (1987) introduced adjusted p values in bootstrap context as a means of calibrating asymptotic uniform p values so that they become second-order correct, whereas we use them here to render asymptotically nonuniform candidate p values first-order correct. When estimated by simulation, p[sub plug,adj] is precisely the "double parametric bootstrap" p value of Beran (1987) and Davison and Hinkley (1997, p. 177); the computation burden of such simulations can perhaps be alleviated by recycling (Newton and Geyer 1994). A double bootstrap simulation may be avoided because the representation p(X) = 1 - PHI(Q) + o[sub p](1), under H[sub 0], of Theorem 1 implies that p[sub adj,anal] = 1 - PHI[tau[sup -1](theta)z[sub 1-p]] is a simple analytic approximation to p[sub adj]. It is easy to show that for any of our candidate p values, both p[sub adj](X) and p[sub adj,anal](X) have the same limiting distribution as p[sub ppost](X) and p[sub cpred](X) under f(x; k[sub n]/square root of n, theta]).
Given a candidate p value defined via (1) with reference density m(t|x[sub obs]) equivalent to m[sub T](t|x[sub obs]), define the adjusted reference density evaluated at t[sub obs] to be
m[sub adj](t[sub obs]|x[sub obs]) = f[sub p](x)[p; theta[sub obs]]m(t[sub obs]|x[sub obs]),
where f[sub p](x)(u; theta) = differential F[sub p](x)(u; theta)/differential u and p = p(t[sub obs]). Then p[sub adj] is obtained via (1) with m(t) = m[sub adj](t|x[sub obs]). These densities m[sub adj](t|x[sub obs]) are additional solutions to the frequentist math problem of Section 1.3.
To summarize this section, we have proposed alternative candidate p values that have the same limiting distribution as p[sub ppost](X) and p[sub cpred](X) under the null model and local alternatives. Thus the corresponding nominal alpha-level tests chi(alpha) have the same asymptotic power as the tests chi[sub ppost](alpha) and chi[sub cpred](alpha) based on t(X); that is, ARP(alpha, beta, theta) = beta and ARE(alpha, beta, theta) = 1. The choice as to which p value to use in practice will depend both on the relative ease with which each can be calculated and on their second-order asymptotic and small-sample nonasymptotic distributional properties. These topics are beyond the scope of this article, although example 2.1 of Berger and Bayarri (2000) suggests that p[sub ppost] and p[sub cpred] will be preferred to p[sub post] and p[sub plug] in small samples, even when the mean of t(X) does not depend on theta. When one has a specific alternative model in mind, a major advantage of the discrepancy p value based on the efficient score discrepancy (8) is that it is guaranteed to be locally most powerful whatever the value of theta generating the data.
The first theorem in this section, Theorem 3, derives the asymptotic expansion
(9) p(X) = 1 - PHI(Q) + o[sub p](1)
for a particular random variable Q. Theorem 3 and corollary 3 allow us to deduce that Q has the N(k mu(theta), tau[sup 2](theta)) distribution specified in Theorems 1 and 2. All of our candidate p values, including p[sup dis] but excluding p[sub cpred], can be written as
(10) p = p(x[sub obs]) = Integral of[sub THETA]Pr[t(X, theta) > t(x[sub obs], theta); theta]pi(d theta)|x[sub obs]).
By taking t(X, theta) = t(X), we obtain the nondiscrepancy p values. Here pi(d theta|x[sub obs]) - pi(theta|x[sub obs])d theta is given in Table 1 for p[sub ppost] and p[sub post]. We take pi[sub dis](.|X[sub obs]) = pi[sub post](.|X[sub obs]). For p[sub plug] and p[sub cplug], we take pi(d theta{x[sub obs]) to be the degenerate distribution that places all of its mass on theta[sub obs] and theta[sub cMLE,obs]. When we take pi(.|x[sub obs]) in (10) to be pi[sub cpred](.|x[sub obs]), we obtain a new p value, the approximate conditional predictive p value, p[sub acpred], that we show in Lemma 2 is asymptotically equivalent to p[sub cpred]. Hence it will suffice to prove Theorem 1 for p[sub acpred] in lieu of p[sub cpred]. It will be useful to have an expression for the random p value p(X); specifically,
(11) p(X) =Integral of[sub THETA]Pr[t(X[sup new], theta) > t(X, theta) |X; theta]pi(d theta|X),
where X[sup new] is drawn from f(x; theta) independently of X.
We prove Theorems 1 and 2 together. The asymptotic normality of t(X, theta) remains a basic assumption. Now if t(X, theta) is allowed to depend on an additional parameter theta, then it is natural to allow the asymptotic mean and variance of t(X, theta[sup *]) under theta to also depend on theta[sup *]. Thus we assume (3), but with nu[sub n](psi, theta) replaced by nu[sub n](psi, theta: theta[sup *]) and sigma[sup 2](theta) replaced by sigma[sup 2](theta: theta[sup *]). These quantities are the asymptotic mean and variance of t(X, theta[sup *]) under theta.
Actually, because the measures pi(d theta|X) concentrate on small (shrinking) neighborhoods of theta[sub 0], the dependence of t(X, theta) on theta does not play a major role, as soon as we assume some natural continuity in this parameter. Specifically, we assume that for every random sequence theta[sub n] = theta[sub 0] + O[sub p](n[sup -1/2]),
(12) [Multiple line equation(s) cannot be represented in ASCII text]
which we take as trivially true when t(X, theta) = t(X). Further, we assume that for some p-vector-valued function theta[sup A](X) on the sample space and some p x p matrix SIGMA(theta[sub 0]),
(13) [Multiple line equation(s) cannot be represented in ASCII text]
and
(14) n[sup 1/2](theta[sup A](X) - theta[sub 0]) = O[sub P[sub theta[sub 0]]](1).
Here ||.|| is the total variation distance between two distributions P and Q: ||P - Q|| = 2sup[sub B] ||P(B) - Q(B)||, with the supremum taken over all Borel sets in R[sup p] intersection THETA. Let sigma(theta[sub 0]) = sigma(theta[sub 0]: theta[sub 0]), nu[sub theta](theta[sub 0]) = lim[sub n arrow right infinity]differential nu[sub n](0, theta: theta[sub 0])/ differential theta[sub |theta-theta[sub 0]] and nu[sub psi(theta) = lim[sub n arrow right infinity]differential nu[sub n](psi, theta: theta)/ differential psi[sub |psi=0]. Then, under conditions discussed later, (13) and (14) hold for the choices of theta[sup A](X) and SIGMA(theta[sub 0]) specified in Table 2.
Theorem 3. Suppose that (12)-(14) and (3) hold with t(X, theta) and nu[sub n](0, theta: theta) replacing t(X) and nu[sub n](0, theta). Then, under X is similar to f(x: theta[sub 0]),
(15) p(X) = 1 - PHI(Q) + o[sub p](1),
where
(16) [Multiple line equation(s) cannot be represented in ASCII text]
As this theorem is our main result, we give an informal proof that emphasizes the main idea. We give a formal proof in Appendix A.
Informal Proof of Theorem 3. Note that t(X[sup new], theta) > t(X, theta) is algebraically equivalent to
(17) [Multiple line equation(s) cannot be represented in ASCII text]
Now, by (13) and (14), we can asymptotically ignore all theta that are not within O(n[sup -1/2]) of theta[sub 0]. Hence by (12), the left side of (17) is approximately
n[sup 1/2]{t(X[sup new], theta[sub 0]) - nu[sub n](0, theta[sub 0]: theta[sub 0])}
- n[sup 1/2]{nu[sub n](0, theta: theta[sub 0]) - nu[sub n](0, theta[sub 0]: theta[sub 0])}
= n[sup 1/2]{t(X[sup new], theta[sub 0]) - nu[sub n](0, theta: theta[sub 0])}.
By (12) and the differentiability assumption in (3b), the right side of (17) is approximately
(18) Square root of n{t(X, theta[sub 0]) - nu[sub n](0, theta[sub 0]: theta[sub 0])} - nu[sub theta](theta[sub 0])' Square root of n(theta -theta[sub 0]).
Hence the event t(X[sup new], theta) > t(X, theta) is approximately equivalent to the event
(19) [Multiple line equation(s) cannot be represented in ASCII text]
Now conditional on (X, theta), by (3a), the first term on the left side of (19) is approximately N(0, sigma[sup 2](theta)), and, given X, by (13), the second term on the left side of (19) is converging to a N(nu[sub theta](theta[sub 0])'n[sup 1/2](theta[sup A](X) - theta[sub 0]), nu[sub theta](theta[sub 0])'SIGMA(theta[sub 0])nu[sub theta](theta[sub 0])) distribution. Further, because the first term has mean 0 conditional on (X, theta), the two terms are conditionally uncorrelated given X. Hence, given X, the left side of (19) is asymptotically
(20) N(nu[sub theta](theta[sub 0])' square root of n(theta[sup A](X) -theta[sub 0]), sigma[sup 2](theta[sub 0])
+ nu[sub theta](theta[sub 0])'SIGMA(theta[sub 0])nu[sub theta](theta[sub 0])).
It follows that the conditional probability, given X, of the event t(X[sup new], theta) > t(X, theta) is approximately 1 - PHI(Q) with Q as in (16), which concludes the proof.
Under the assumption that the distribution of X under (psi[sub n], theta[sub 0]) and (0, theta[sub 0]) are contiguous (which will quite generally be the case), the expansion of Theorem 3 is also valid under X is similar to f(x; psi[sub n], theta[sub 0]). It remains to show that the distribution of Q in (16) converges to the N(u(theta[sub 0]), tau[sup 2](theta[sub 0])) distribution given in Theorems 1 and 2. For this, we need the joint distribution of t(X, theta[sub 0]) and theta[sup A](X).
In all of the examples that we consider, we will have for given "influence functions" B[sub i](theta: theta) that for all theta in a neighborhood of theta[sub 0], with X is similar to f(x; theta[sub 0]),
(21) [Multiple line equation(s) cannot be represented in ASCII text]
for some mean 0 B[sub i](theta[sub 0]: theta) = b[sub i](X[sub i], theta[sub 0]: theta),
(22) n[sup 1/2](theta - theta[sub 0]) = i[sup -1, sub theta theta](theta[sub 0])n[sup -1/2] S[sub theta](theta[sub 0]) + o[sub p](1),
(23) n[sup 1/2](theta[sub cMLE] - theta[sub 0]) = i[sub c,theta theta](theta[sub 0])[sup -1] {n[sup 1/2]S[sub c theta](theta[sub 0])} + o[sub p](1)
where
(24) S[sub c theta](theta[sub 0]) equivalent to S[sub theta](theta[sub 0]) - nu[sub theta](theta[sub 0])sigma[sup -2](theta[sub 0])n[sup 1/2]T[sub std](theta[sub 0]),
T[sub std](theta[sub 0]) = n[sup 1/2]{t(X) - nu[sub n](0, theta[sub 0]: theta[sub 0])},
and i[sub c, theta theta](theta[sub 0]) is defined in Table 2. Equation (21) is the usual asymptotically linear expansion of an asymptotically normal statistic, showing that it behaves like a sample average, and (22) is the usual expansion of the MLE. Equation (23) is a conditional version of (22), which we discuss further later. Given the foregoing expansions, the joint limit distribution of t(X, theta) and theta[sup A] under the null hypothesis (psi[sub n] = 0, theta[sub 0]) follows immediately from the multivariate central limit theorem (CLT) (where we need to assume the Lindeberg-Feller conditions to take care of the possible non-iid character of the terms in the sums). Note that the right sides of (22) and (23) are also sums. The expansions (22) and (23) imply that the asymptotic variance of the MLE and conditional MLE are i[sup -1, sub theta theta](theta[sub 0]) and i[sub c, theta theta](theta[sub 0])[sup -1]. To obtain the limit distribution under alternatives (psi[sub n], theta[sub 0]), we make the further assumption that as n arrow right infinity, for (k[sub n], k[sup *, sub n]) arrow right (k', k[sup *]) Is equivalent to h,
(25) log f(X; k[sub n]/square root of n, theta[sub 0] + k[sup *, sub n]/square root of n)/ f(X; theta[sub 0]) = h'n[sup 1/2] (S[sub psi](theta[sub 0]), S[sub theta](theta[sub 0])' - 1/2 h' i(theta[sub 0]) h + O[sub P] (1)
and
(26) [Multiple line equation(s) cannot be represented in ASCII text]
where
[Multiple line equation(s) cannot be represented in ASCII text]
and we assume the sum on the right side of (21) satisfies the Lindeberg condition. Equations (25)-(26) imply that the model f(x; psi, theta) is locally asymptotically normal (LAN) at (0, theta[sub 0]). Therefore, we can apply LeCam's third lemma to obtain the desired result (van der Vaart 1998). Specifically, we obtain the following theorem.
Theorem 4. Given t(X, theta) and model (2a)-(2b), suppose that both (21)-(26) and the assumptions of Theorem 3 hold. Then the following obtain:
(27)
a. [Multiple line equation(s) cannot be represented in ASCII text]
where (Multiple lines cannot be converted in ASCII text) (S[sub 1], S[sub 2]) denotes the asymptotic covariance of S[sub 1] and S[sub 2] under (psi = 0, theta[sub 0]).
b. When X Is similar to f(x; k[sub n]/square root of n, theta[sub 0]),(T[sub std](theta[sub 0]) nu[sub theta] (theta[sub 0])'n[sup 1/2](theta - theta[sub 0]) converges to a normal distribution with mean (Multiple lines cannot be converted in ASCII text) (Multiple lines cannot be converted in ASCII text) and covariance matrix
(28) [Multiple line equation(s) cannot be represented in ASCII text]
and thus T[sub std] (theta[sub 0]) - nu'(theta[sub 0])'n[sup 1/2]{theta - theta[sub 0]} converges to a N(k omega(theta[sub 0]), sigma[sup -2](theta[sub 0]) - omega(theta[sub 0])) distribution. Further, (T[sub std](theta[sub 0]), nu[sub theta](theta[sub 0])'n[sub 1/2](theta[sub cMLE] - theta[sub 0])) converges to a normal distribution with mean k(nu[sub psi](theta[sub 0]), nu[sub theta](theta[sub 0]) 'i[sup -1, sub c theta theta](theta[sub 0]) i[sub c, psi theta](theta[sub 0])) and covariance matrix
(29) [Multiple line equation(s) cannot be represented in ASCII text]
where (Multiple lines cannot be converted in ASCII text) and (Multiple lines cannot be converted in ASCII text). Thus T[sub std](theta[sub 0]) - nu[sub theta](theta[sub 0])'n[sup 1/2] {theta[sub cMLE] - theta[sub 0]} converges to a N(k omega[sub c](theta[sub 0]), sigma[sup 2](theta[sub 0]) + omega[sub c](theta[sub 0])) distribution, where (Multiple lines cannot be converted in ASCII text).
Remark 6. A critical observation required in applying LeCam's third lemma to obtain the results in Theorem 4(b) is that 0 = (Multiple lines cannot be converted in ASCII text), which is a consequence of the fact that, by (27), n[sup -1/2] S[sub c theta](theta[sub 0]) is the residual from the asymptotic least squares projection of n[sup -1/2] S[sub theta] (theta[sub 0]) on the normalized test statistic T[sub std] (theta[sub 0]), because
[Multiple line equation(s) cannot be represented in ASCII text]
As shown in Corollary 3, Theorems 1 and 2 follow from Theorems 3 and 4, provided that we can establish that (13) and (14) hold for the entries in Table 2. The first row of the table merely asserts asymptotic normality of the MLE and hence is valid under the usual conditions. The second row is the assertion of the Bernstein-Von Mises theorem and hence is valid under even weaker conditions. Primitive conditions to ensure the validity of the last three rows of Table 2 are less easily available. We do not provide such a set of conditions, but rather offer below an informal argument as to why these rows are expected to be correct.
Corollary 3. Under the assumptions of Theorem 4, if (13) and (14) hold for the entries in Table 2, then, with p[sub acpred] substituted for p[sub cpred] the p values considered in Theorems 1 and 2 have the asymptotic expansions given therein.
Proof of Corollary 3. For the plug-in, posterior predictive, and discrepancy p values, the proof is immediate from Theorems 3 and 4. [Note that if the off-diagonal entries in (28) were 0 rather than omega(theta[sub 0]), then, even if the variance of nu[sub 0](theta[sub 0])' n[sup -1/2](theta - theta[sub 0]) had remained nonzero, Q for the partial posterior and discrepancy would have had variance 1, and the associated p values would not be conservative. But the covariance omega(theta[sub 0]) is in fact nonzero whenever nu[sub theta](theta[sub 0]) is nonzero.] Furthermore, it is immediate from Theorems 3 and 4 that expansion (16) holds with Q Is similar to N(k omega[sub c](theta[sub 0])/{sigma[sup 2] (theta[sub 0]) + omega[sub c](theta[sub 0])}[sup 1/2], 1) for p[sub ppost] (X) and [[sub acpred](X) and Q Is similar to N(k omega[sub c] (theta[sub 0])/ sigma (theta[sub 0]), {sigma[sup 2](theta[sub 0]) + omega[sub c](theta[sub 0])}/sigma[sup 2](theta[sun 0])) for p[sub cplug] (X). But some algebra shows that {sigma[sup 2](theta[sub 0]) + omega[sub c](theta[sub 0])}/sigma[sup 2](theta[sub 0]) = sigma[sup 2](theta[sub 0])/{sigma[sup 2](theta[sub 0]) -omega(theta[sub 0])} and omega[sub c](theta[sub 0])/{sigma[sup 2](theta[sub 0]) + omega[sub c](theta[sub 0])}[sup 1/2] = NC(theta[sub 0]), which proves the corollary.
To complete the proof of Theorem 1, it only remains to show that p[sub cpred] (X) and p[sub acpred] (X) have the same limiting distribution. The key observation is that, as discussed earlier, theta[sub cMLE] and t(X) are asymptotically uncorrelated, so that in large samples the conditional distribution given theta[sub cMLE] and unconditional distribution of t(X) are the same. Formally, we have the following lemma, whose proof is similar to aspects of the p[sub rgof] of Theorem 3 given in Appendix A and thus is omitted.
Lemma 2. If for every c,
[Multiple line equation(s) cannot be represented in ASCII text]
then p[sub cpred](X) and p[sub acpred](X) have the same limiting distribution under f(x; k[sub n]/square root of n, theta[sub 0]).
The supposition of Lemma 2 would need to be checked on a case-by-case basis, as general regularity conditions for it are not known.
Conditional Inference. If our given statistic T = t(X) satisfies (3), then we would expect the marginal model f[sub T](t; psi, theta) = f(t; psi, theta) to be LAN. That is, under T Is similar to
(30) [Multiple line equation(s) cannot be represented in ASCII text]
Together with the similar expansion (25) for the unconditional model for X, we obtain, on noting f(X) = f(X|T)f(T), that
(31) log f(X|T; k[sub n]/square root of n, theta[sub 0] + k[sup *, sub n]/square root of n)/ f(X|T; 0, theta[sub 0]) = h' n[sup -1/2] S[sub c](theta[sub 0]) - 1/2 h' i[sub c] (theta[sub 0]) h + o[sub P](1),
where
S[sub c] (theta[sub 0]) Is equivalent to (S[sub s psi]) (theta[sub 0]), S[sub c theta]) (theta[sub 0])' S[sub c psi] (theta[sub 0]) = S[sub psi] (theta[sub 0] - nu[sub psi] (theta[sub 0] sigma[sup -2] (theta[sub 0] n[sup 1/2] T[sub std](theta[sub 0])
and
[Multiple line equation(s) cannot be represented in ASCII text]
is the asymptotic covariance matrix of n[sup -1/2] S[sub c](theta[sub 0]). The vector S[sub c](theta[sub 0]) is referred to as the conditional score because it is the linear term in the expansion of the conditional density, and i[sub c](theta[sub 0]) is the conditional information matrix. As noted earlier, n[sup -1/2]S[sub c](theta[sub 0]) is the residual from the asymptotic least squares projection of n[sup -1/2]S(theta[sub 0]) on the normalized test statistic T[sub std](theta[sub 0]).
Note that if n[sup 1/2]{T - nu[sub n](k[sub n]/square root of n, theta[sub 0] + k[sup *, sub n]/square root of n)} were exactly distributed N(0, sigma[sup 2](theta[sub 0])) under f[sub T](t; k[sub n]/square root of n, theta[sub 0] + k[sup *, sub n]/square root of n) with nu[sub n](k/square root of n, theta[sub 0] + k[sup *, sub n]/square root of n) = nu[sub n] (0, theta[sub 0]) + nu[sub psi](theta[sub 0])k[sub n]/square root of n + nu[sub 0](theta[sub 0]) k[sup *, sub n]/k[sup *, sub n]/square root of n, then (30) would be exactly true without the o[sub p](1) term. But to establish (30) for general asymptotically normal statistics T requires additional regularity conditions, which we discuss in Appendix B. For example, we show that if T = n[sup -1] SIGMA[sub i] d(X[sub i]) and the X[sub i] are iid, then (30) holds if d(X[sub i]) has either an absolutely continuous component or d(X[sub i]) is discrete with finite support.
The expansion (31) is the basis for deriving the asymptotic distribution of the conditional MLE and the validity of the last three rows of Table 2. First, the expansion suggests that theta[sub MLE] maximizing/(X|T; theta) satisfies (23). The expansion (23) is similar to the expansion (22) for the unconditional MLE, but with the conditional score and information substituted for the unconditional ones. Second, we may expect a conditional Bernstein-von Mises theorem to hold. Basically, what is lacking for a full proof of these results is a proof of square root of n consistency of theta[sub MLE] and square root of n consistency of the conditional posterior. These are not trivial matters, but they are of a technical nature and do not add to our knowledge of the form of the limits. This form is determined by the expansion (31) only. We content ourselves with providing in Appendix B exact conditions for the validity of the structural expansion (31) and sketching in Appendix C a direct proof for Example B of Section 3.
Proof of Lemma 1. We only need to prove the lemma in the special case where t(X) = n[sup -1]S[sub psi](theta[sup *]) because, from its definition, NC(theta[sup *]) will be the same for a given statistic t[sub 1](X) and all affine transformations of t[sub 1](X),t(X) = at[sub 1](X) + b + o[sub p](1), with a Is not equal to 0. Now, by Theorem 4, for t(X) = n[sup -1] S[sub psi](theta[sup *]),
nu[sub psi](theta) = i[sub psi psi](theta[sup *])
and
nu[sub theta](theta[sup *]) = i[sub psi theta] (theta[sup *])
which proves the lemma.
Legend for Chart: A - Method B - Reference density A B Plug-in (p[sub plug]) m[sub plug] (x|x[sub bos]) = f(x; theta[sub obs]) Prior predictive (p[sub prior]) m[sub prior] (x) = Integral of f(x; theta)pi(theta) d theta Posterior predictive (p[sub post]) m[sub post](x|x[sub obs]) = Integral of f(x; theta)pi[sub post](theta|x[sub obs]) d theta Partial posterior predictive (p[sub ppost]) m[sub ppost](x|x[sub obs]) = Integral of f(x; theta)pi[sub ppost](theta|x[sub obs]) d theta Conditional predictive (p[sub cpred]) m[sub cpred] (x|x[sub obs]) = Integral of f(x|theta[sub cMLE[s]; theta)pi[sub cpred] (theta|x[sub obs]) d theta Conditional piug-in (p[sub cplug]) m[sub cplug(x|x[sub obs]) = f(x; theta[sub cMLE, obs) Discrepancy (p[sub dis]) m[sub dis](x, theta|x[sub obs]) = f(x; theta)pi[sub post](x[sub obs]) NOTE: The data model is f(x; theta), where theta has MLE theta[sub obs]. The prior for theta is pi(theta), and the posterior pi[sub post](theta|x[sub obs]) proportional to f(x[sub obs]; theta)pi(theta). The posterior in the conditional model f(x|t;theta) is pi[sub ppost] (theta|x[sub obs]) proportional to f(x[sub obs]|t[sub obs]; theta)pi(theta), where t = t(x) is the test statistic. The conditional MLE, theta[sub cMLE,obs], is the maximizer of f(x[sub obs]|t[sub obs]; theta). The posterior for theta in the "marginal" model, where only the statistic theta[sub cMLE] is available to the data analyst, is pi[sub cpred] (theta|x[sub obs]) -- Is equivalent to pi[sub cpred] (theta|theta[sub cMLE,obs]) proportional to f(theta[sub cMLE,obs]; theta)pi(theta). Here f(theta[sub cMLE,obs], theta) denotes the marginal density of the random variable theta[sub cMLE] evaluated at its observed value theta[sub cMLE,obs], and f(x|theta[sub cMLE,obs]; theta) denotes the conditional density of x given theta[sub cMLE].
Legend for Chart: A - Method B - theta[sup A](x) C - SIGMA(theta[sub 0] A B C Plug-in theta 0 Posterior predictive and discrepancy theta i[sup -1, sub theta theta](theta[sub 0] Conditional predictive and approximate conditional predictive theta[sub cMLE] i[sup -1, sub c, theta theta](theta[sub 0] Is equivalent to {i[sub theta theta](theta[sub 0] - sigma[sup -2] (theta[sub 0]nu[sub theta] (theta[sub 0])[sup (x)2]} [sup -1] Partial posterior predictive theta[sub cMLE] i[sup -1, sub c theta theta](theta[sub 0] Conditional plug-in theta[sub cMLE] 0
GRAPHS: Figure 1. Example A. Asymptotic cdfs of Candidate p Values, for (a) rho = .5, (b) rho = .9, and (c) rho = .99.
GRAPHS: Figure 2. Example A. (a) Actual(alpha, theta), (b) ARP(alpha, beta, theta), and (c) ARE(alpha, beta, theta) as Functions of rho for alpha = 5% and beta = 80%. The vertical axes of (a) and (c) were truncated for quality of display; note that the actual level of p[sub cplug](X) arrow right 1/2 as rho arrow right 1, and its ARE arrow right infinity.
GRAPHS: Figure 3. Example B. (a) Actual(alpha, theta) for alpha = 5% as a Function of q, (b) ARP(alpha, beta, theta) and (c) ARE(alpha, beta, theta), for alpha = 5% and beta = 80%. The vertical axes of (a) and (c) were truncated for quality of display.
GRAPHS: Figure 4. Example C. (a) Actual(alpha, theta), (b) ARP(alpha, beta, theta), and (c) ARE(alpha, beta, theta) as Functions of q for alpha = 5% and beta = 80%. The vertical axes of (a) and (c) were truncated for quality of display.
Bayarri, M. J., and Berger, J. O. (1999), "Quantifying Surprise in the Data and Model Verification," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Oxford, U.K.: Oxford University Press, pp. 53-67.
----- (2000), "P Values in Composite Null Models," Journal of the American Statistical Association, 95, 1127-1142.
Beran, R. J. (1988), "Pre-Pivoting Test Statistics: A Bootstrap View of Asymptotic Refinements," Journal of the American Statistical Association, 83, 687-697.
Box, G. E. P. (1980), "Sampling and Bayes Inference in Scientific Modeling and Robustness," Journal of the Royal Statistical Society, Set. A, 143, 383-430.
Davison, A. C., and Hinkley, D. V. (1997), Bootstrap Methods and Their Application, Cambridge, U.K.: Cambridge University Press.
Evans, M. (1997), "Bayesian Inference Procedures Derived via the Concept of Relative Surprise" Communications in Statistics, 26, 125-143.
Feller, W. (1971), An Introduction to Probability Theory and Its Applications, Vol. 2, New York: Wiley.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, New York: Wiley.
Gelman, A., Meng, X. L., and tern, H. (1996), "Posterior Predictive Assessment of Model Fitness via Realized Discrepancies" (with discussion), Statistica Sinica, 6, 733-807.
Ghosh, J. K. (1994), "Higher-Order Asymptotics," NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 4, Hayward, CA: Institute of Mathematical Statistics.
Guttman, I. (1967), "The Use of the Concept of a Future Observation in Goodness-of-Fit Problems," Journal of the Royal Statistical Society, Ser. B, 29, 83-100.
LeCam, L. (1986), Asymptotic Methods in Statistical Decision Theory, New York: Springer-Verlag.
LeCam, L., and Yang, G. (1988), "On the Preservation of Local Asymptotic Normality Under Information Loss," The Annals of Statistics, 16, 483520.
Meng, X. L. (1994), "Posterior Predictive p Values," The Annals of Statistics, 22, 1142-1160.
Newton, M. A., and Geyer, C. J. (1994), "Bootstrap Recycling: A Monte Carlo Alternative to the Nested Bootstrap," Journal of the American Statistical Association, 89, 905-912.
Pauler, D. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification" by M. J. Baylarri and J. O. Berger in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Oxford, U.K.: Oxford University Press, pp. 70-73.
Robins, J. M. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification" by M. J. Baylarri and J. O. Berger in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Oxford, U.K.: Oxford University Press, pp. 67-70.
Rubin, D. B. (1984), "Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician," The Annals of Statistics, 12, 1151-1172.
Stigler, S. M. (1977), "Do Robust Estimators Work With Real Data," The Annals of Statistics, 5, 1055-1077.
van der Vaart, A. W. (1998), Asymptotic Statistics, Cambridge, U.K.: Cambridge University Press.
APPENDIX A: PROOF OF THEOREM 3
By (14), we have that with X Is similar to f(x; theta[sub 0]), for all epsilon > 0, there exists a constant c[sub epsilon] such that
(A.1) [Multiple line equation(s) cannot be represented in ASCII text]
where E[sub N(mu, epsilon)] refers to expectation with respect to a normal distribution with mean mu and variance matrix SIGMA. Equation (A.1) says that with large probability, X is such that when theta Is similar to N(theta[sup t](X), SIGMA(theta[sub 0])/n), theta lies in the ball of radius c[sub epsilon]/square root of n around theta[sub 0] with high probability.
Now, because the total variation norm ||P - Q|| is also equal to 2sup[sub f]{|Is integral of f dP - Is integral of f dQ |: 0 Is less than or equal to f Is less than or equal to 1)} and theta *(This character cannot be converted in ASCII text) Pr(t(X[sup new], theta) Is less than or equal to t(X, theta)|X; theta) is uniformly bounded by 1, we have, by (13), that
(A.2) [Multiple line equation(s) cannot be represented in ASCII text]
where phi(theta; mu, SIGMA) is the density of a N(mu, SIGMA) random variable and, in (A.2), the integrand can be defined in an arbitrary way for theta Is not an empty set of THETA.
Now if we restrict the integral to the set {theta: || theta - theta[sub 0]|| < c[sub epsilon]/square root of n}, then it changes at most by (Multiple lines cannot be converted in ASCII text), which, with probability at least 1 - epsilon, is less than epsilon. Let A[sub epsilon] be the event on which this is true, so that (Multiple lines cannot be converted in ASCII text). We can write the integrand as Pr [Eq. (17)|X; theta]. By (12), we have that for theta[sub n] = theta[sub 0] + O(1/square root of n), n[sup 1/2] (t(X[sup new], theta[sub n]) -nu[sub n](0, theta[sub 0]: theta[sub n])) - n[sup 1/2] (nu[sub n] (0, theta[sub n], theta[sub 0]) - nu[sub n] (0, theta[sub 0]: theta[sub 0) = n[sup 1/2] (t(X[sup new], theta[sub 0]) - nu[sub n] (0, theta[sub n]: theta[sub 0])) + o[sub p[sub theta[sub 0]]] (1). By the contiguity assumption in (3), this is true also for the remainder term o[sub p[sub theta n]] (1). By (3a), 1/sigma(theta[sub 0]) times the right side of the last equality is asymptotically standard normal under theta[sub n]. Thus for every c,
[Multiple line equation(s) cannot be represented in ASCII text]
Next, by (3b) holding at theta[sub 0] and (12), for every theta[sub n] = theta[sub 0] + O[sub P](1/square root of n),
[Multiple line equation(s) cannot be represented in ASCII text]
[Multiple line equation(s) cannot be represented in ASCII text]
where |phi||[sub infinity] is the maximum of a N(0, 1) density. By combining the two previous displays, we see that
[Multiple line equation(s) cannot be represented in ASCII text]
Now, by combining this display with (A.2), we obtain
[Multiple line equation(s) cannot be represented in ASCII text]
This being true for every epsilon > 0 implies that 1 - p(X) is asymptotically equivalent to
Integral of PHI (n[sup 1/2] (t,X, theta[sub 0]) - v[sub n](0, theta[sub 0]: theta[sub 0]))/sigma(theta[sub 0]) - nu[sub Theta] (theta[sub 0]' sigma[sup -1] (theta[sub 0] n[sup 1/2] (theta - theta[sub 0])) x phi(theta; theta[sup A](X), SIGMA(theta[sub 0])/n)d theta = PHI(Q),
where Q is given by (16).
Lemma B.1. Suppose that (21), (25), and (26) hold for a statistic T = t(X). Furthermore, suppose that under f(x; k/square root of n, theta[sub 0] + k[sup *]/square root of n),
(B.1) square root of n{T - nu[sub n](k/square root of n, theta[sub 0] + k[sup *]/square root of n)}
converges in variation distance to a N(0, sigma[sup 2](theta[sub 0])) distribution for all k is an element of R[sup 1], k[sup *] is an element of R[sup P]. Then (27) holds.
Idea of Proof. The lemma is essentially a consequence of theorem 4 of LeCam and Yang (1988), because in their terminology, our assumptions imply that n[sup 1/2](T- nu[sub n](psi, theta)) is distinguished in local experiments indexed by (psi, theta) = (k/square root of n, theta[sub 0] + k[sup *]/square root of n) with theta[sub 0] known. Details will be presented elsewhere.
If our statistic T equals n[sup -1] sigma[sub i] d(X[sub i]), then condition (B.1) is satisfied for h = (k, k[sup *]) = 0 if d(X[sub i]) has a finite second moment and the distribution of d(X[sub i]) has an absolutely continuous component and the X[sub i] are iid. This follows by theorem XV. 5.2 of Feller (1971). For general h, (B.1) will be true if we make these conditions uniform in theta running through a neighborhood of theta[sub 0]. For more general asymptotically normal test statistics T, results such as (B. 1) appear to be usually established as part of the derivation of an Edgeworth expansion for the distribution of T. (A discussion, with special attention to curved exponential families, and further references have been given in, for example, Ghosh 1994, chap. 2.) Results of this type are nontrivial. The use of the total variation norm makes (B.I) much more restrictive than convergence in law of n[sup 1/2](T - nu[sub n](k/square root of , theta[sub 0] + k[sup *]/square root of n)).
Condition (B.1) is certainly stronger than needed. It would be sufficient that the sequence n[sup 1/2](T - nu[sub n](psi, theta)) be distinguished in local experiments consisting of observing T with parameter (psi, theta) = (k/square root of, theta[sub 0] + k[sup *]/square root of n), with theta[sub 0] being known, and k is an element of R[sup 1] and k[sup *] is an element of R[sup p]. This concept was discussed by LeCam (1986), along with sufficient conditions, but the discussion is involved. We now discuss an important special case in which (B.I) can be relaxed. If T = T[sub n] is lattice distributed with the span of lattice possibly depending on n, but not on theta, then observing T[sub n] is statistically equivalent to observing a smoothed version of T[sub n], if the smoothing is performed within the intervals generated by the lattice. In this case it can suffice to verify (B.1) for smooth versions of the law of T[sub n]. We make this precise in the following theorem. We assume that T[sub n] = n[sup -1 SIGMA[sub i] d(X[sub i] with d(X[sub i]) taking its values in a grid of points..., a - s, a, a + s, a + 2s,... for fixed numbersa and s (the span of the lattice). It appears that we can always arrange this without loss of generality. For example, if d(X[sub i]) is finitely discretely distributed, then it certainly suffices that d(X[sub i]) takes finitely many values in the rationals only. We assume that a and s do not depend on theta.
Theorem B.1. Suppose that the X[sub i] are iid and E[sub psi n, theta n]|d(X[sub i)][sup 3] = O(1) for every psi[sub n] arrow right 0 and theta[sub n] arrow right theta[sub 0]. Assume that nu(psi, theta) = E[sub psi, theta][d(X[sub i])] is differentiable at (0, theta[sub 0]) and sigma[sup 2] (psi, theta) = var[sub psi, theta]{d(X[sub i])} is continuous at (0, theta[sub 0]). Finally, assume that the distribution of n[sup 1/2]t(X) under f(x; psi[sub n], theta[sub n]) converges in law to the distribution of d(X[sub i]) under (psi, theta) = (0, theta[sub 0]). Then (30) holds. The proof will be given elsewhere.
Consistency and Asymptotic Normality of theta[sub cMLE]
For simplicity and without loss of generality, we consider the case where the variance c[sup 2] is known and equal to 1, so theta = mu. Thus X[sub 1],...,X[sub n] are iid N(theta, 1). Let Z[sub q] = Z[sub qn] be the nq[sub n]-order statistic where q[sub n] arrow right q and nq[sub n] is an integer. Then data from X|Z[sub qn] = x can be generated by generating iid data Y[sub 1],...,Y[sub nqn-1] from the density I(y less than or equal to x)phi(y - theta)/phi(x - theta) and then generating Y[sub nqn+1],..., Y[sub n] iid from the density I(y > x)phi(y -theta)/{1 - phi(x - theta)} independently of the previous Y[sub i]. Thus the likelihood function is proportional to
[Multiple line equation(s) cannot be represented in ASCII text]
Because the likelihood function is the product of the likelihood for two lid identified parametric models with common parameter theta, it follows that the maximizer theta[sub cMLE] of (C.1) is consistent for theta for each fixed x. Because the convergence of theta[sub cMLE] to theta is uniform in x for x in a neighborhood of z[sub q](theta), we conclude that theta[sub cMLE] is (unconditionally) consistent for theta. Now theta[sub cMLE] satisfies the conditional score equation
PSI(theta[sub cMLE], X[sub (nqn])) = 0
where X[sub (j)] is the jth-order statistic and
(C.2) [Multiple line equation(s) cannot be represented in ASCII text]
This implies that up to terms of O(1/n), psi(theta, x) is approximated by
(C.3) [Multiple line equation(s) cannot be represented in ASCII text]
The approximation in (C.3) is obtained by adding X[sub (nqn)] - theta and replacing nq[sub n - 1] by nq[sub n].
By Taylor expansion of the score equation, we obtain
(C.4) n[sup 1/2] (theta[sub cMLE] - theta) = -1/2 PSI(theta, X[sub (xqn)]/PSI(theta, X[sub (nqn]))
for theta between theta[sub cMLE] and theta, where psi, is the derivative of psi(theta, x) with respect to theta. By theta[sub cMLE] consistent for theta, we have, from (C.3), by a further expansion around theta,
[Multiple line equation(s) cannot be represented in ASCII text]
and (Multiple lines cannot be converted in ASCII text) as required.
Proof of (13)-(14) for pi[sun ppost](.| X)in Example B of Section 3
The posterior density for theta based on the conditional model f (x|T = x; theta) with T = Z[sub q] = X[sub (nqn)] is, by (C.1),
(C.5) [Multiple line equation(s) cannot be represented in ASCII text]
For each x, we can use the Bernstein-von Mises theorem for independent random variables to conclude that
(C.6) [Multiple line equation(s) cannot be represented in ASCII text]
where P[sub x] is the law f (x|T = x; theta), change(theta, x) = theta + PSI(theta, x)/i(theta, x), and i(theta,x) = var[sub theta]{PSI(theta, x)|T = x}. Because the convergence in (C.6) is uniform for x in a neighborhood of z[sub q](theta), we conclude that for every sequence x[sub n] arrow right z[sub q](theta),
[Multiple line equation(s) cannot be represented in ASCII text]
This is sufficient to conclude that
[Multiple line equation(s) cannot be represented in ASCII text]
By dominated convergence, this gives
[Multiple line equation(s) cannot be represented in ASCII text]
However, by (C.4), we can substitute theta[sub cMLE] for change(theta, X[sub (nqn)]) and i[sub c,00] (theta) for i(theta, X[sub (nqn)]), concluding the proof.
[Received January 1999. Revised November 1999.]
~~~~~~~~
By James M. Robins; Aad van der Vaart and Valerie Ventura
James M. Robins is Professor of Epidemiology and Biostatistics, Harvard
School of Public Health, Boston, MA 02115 (E-mail: robins@epinet. harvard.edu).
Aad van der Vaart is Professor of Statistics, Faculty of Mathematics and
Computer Science, Vrije Universiteit, Amsterdam, 1081 HV The Netherlands
(E-mail: aad@cs.vu.nl). Valerie Ventura is Visiting Assistant Professor,
Carnegie Mellon University, Pittsburgh, PA 15213 (E-mail:
vventura@stat.cmu.edu). This work was supported in part by National Institutes
of Health grant R01A132475. The authors thank George Casella and Martin Tanner
for organizing these contributions to JASA, and to the associate editor and to
three referees for helpful criticisms.
Title: | Bayesian Analysis: A Look at Today and Thoughts of Tomorrow. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Offers a look at the Bayesian analysis in statistics. Description on the Bayesian activity; Approaches to Bayesian analysis; Discussion on the computational techniques. |
AN: | 3851552 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
Life was simple when I became a Bayesian in the 1970s; it was possible to track virtually all Bayesian activity. Preparing this paper on Bayesian statistics was humbling, as I realized that I have lately been aware of only about 10% of the ongoing activity in Bayesian analysis. One goal of this article is thus to provide an overview of, and access to, a significant portion of this current activity. Necessarily, the overview will be extremely brief; indeed, an entire area of Bayesian activity might only be mentioned in one sentence and with a single reference. Moreover, many areas of activity are ignored altogether, either due to ignorance on my part or because no single reference provides access to the literature.
A second goal is to highlight issues or controversies that may shape the way that Bayesian analysis develops. This material is somewhat self-indulgent and should not be taken too seriously; for instance, if I had been asked to write such an article 10 years ago, I would have missed the mark by not anticipating the extensive development of Markov chain Monte Carlo (MCMC) and its enormous impact on Bayesian statistics.
Section 2 provides a brief snapshot of the existing Bayesian activity and emphasizes its dramatic growth in the 1990s, both inside and outside statistics. I found myself simultaneously rejoicing and being disturbed at the level of Bayesian activity. As a Bayesian, I rejoiced to see the extensive utilization of the paradigm, especially among nonstatisticians. As a statistician, I worried that our profession may not be adapting fast enough to this dramatic change; we may be in danger of "losing" Bayesian analysis to other disciplines (as we have "lost" other areas of statistics). In this regard, it is astonishing that most statistics and biostatistics departments in the United States do not even regularly offer a single Bayesian statistics course.
Section 3 is organized by approaches to Bayesian analysis--in particular the objective, subjective, robust, frequentist-Bayes, and what I term quasi-Bayes approaches. This section contains most of my musings about the current and future state of Bayesian statistics. Section 4 briefly discusses the critical issues of computation and software.
2.1 Numbers and Organizations
The dramatically increasing level of Bayesian activity can be seen in part through the raw numbers. Harry Martz (personal communication) studied the SciSearch database at Los Alamos National Laboratories to determine the increase in frequency of articles involving Bayesian analysis over the last 25 years. From 1974 to 1994, the trend was linear, with roughly a doubling of articles every 10 years. In the last 5 years, however, there has been a very dramatic upswing in both the number and the rate of increase of Bayesian articles.
This same phenomenon is also visible by looking at the number of books written on Bayesian analysis. During the first 200 years of Bayesian analysis (1769-1969), there were perhaps 15 books written on Bayesian statistics. Over the next 20 years (1970-1989), a guess as to the number of Bayesian books produced is 30. Over the last 10 years (1990-1999), roughly 60 Bayesian books have been written, not counting the many dozens of Bayesian conference proceedings and collections of papers. Bayesian books in particular subject areas are listed in Sections 2.2 and 2.3. A selection of general Bayesian books is given in Appendix A.
Another aspect of Bayesian activity is the diversity of existing organizations that are significantly Bayesian in nature, including the following (those with an active website): International Society of Bayesian Analysis (http://www.bayesian.org), ASA Section on Bayesian Statistical Science (http://www.stat.duke.edu/sbss/sbss.html), Decision Analysis Society of INFORMS (http://www. informs.org/society/da), and ASA Section on Risk Analysis (http://www.isds.duke.edu/riskanalysis/ras.html).
In addition to the activities and meetings of these societies, the following are long-standing series of prominent Bayesian meetings that are not organized explicitly by societies: Valencia Meetings on Bayesian Statistics (http://www.uv.es/~bernardo/valenciam.html), Conferences on Maximum Entropy and Bayesian Methods (http://omega.albany.edu:8008/maxent.html), CMU Workshops on Bayesian Case Studies (http://lib.stat.cmu.edu/ bayesworkshop/), and RSS Conferences on Practical Bayesian Statistics. The average number of Bayesian meetings per year is now well over 10, with at least an equal number of meetings being held that have a strong Bayesian component.
2.2 Interdisciplinary Activities and Applications
Applications of Bayesian analysis in industry and government are rapidly increasing but hard to document, as they are often "in-house" developments. It is far easier to document the extensive Bayesian activity in other disciplines; indeed, in many fields of the sciences and engineering, there are now active groups of Bayesian researchers. Here we can do little more than list various fields that have seen a considerable amount of Bayesian activity, and present a few references to access the corresponding literature. Most of the listed references are books on Bayesian statistics in the given field, emphasizing that the activity in the field has reached the level wherein books are being written. Indeed, this was the criterion for listing an area, although fields in which there is a commensurate amount of activity, but no book, are also listed. (It would be hard to find an area of human investigation in which there does not exist some level of Bayesian work, so many fields of application are omitted.)
For archaeology, see Buck, Cavanaugh, and Litton (1996); atmospheric sciences, see Berliner, Royle, Wikle, and Milliff (1999); economics and econometrics, see Cyert and DeGroot (1987), Poirier (1995), Perlman and Blaug (1997), Kim, Shephard and Chib (1998), and Geweke (1999); education, see Johnson (1997); epidemiology, see Greenland (1998); engineering, see Godsill and Rayner (1998); genetics, see Iversen, Parmigiani, and Berry (1998), Dawid (1999) and Liu, Neuwald, and Lawrence (1999); hydrology, see Parent, Hubert, Bobee and Miquel (1998); law, see DeGroot, Fienberg, and Kadane (1986) and Kadane and Schuan (1996); measurement and assay, see Brown (1993) and http://www.pnl.gov/bayesian/; medicine, see Berry and Stangl (1996) and Stangl and Berry (1998); physical sciences, see Bretthorst (1988), Jaynes (1999), and http://www.astro.cornell.edu/staff/loredo/bayes/; quality management, see Moreno and Rios-Insua (1999); social sciences, see Pollard (1986) and Johnson and Albert (1999).
2.3 Areas of Bayesian Statistics
Here, Bayesian activity is listed by statistical area. Again, the criterion for inclusion of an area is primarily the amount of Bayesian work being done in that area, as evidenced by books being written (or a corresponding level of papers).
For biostatistics, see Berry and Stangl (1996), Carlin and Louis (1996), and Kadane (1996); causality, see Spirtes, Glymour, and Scheines (1993) and Glymour and Cooper (1999); classification, discrimination, neural nets, and so on, see Neal (1996, 1999), Muller and Rios-Insua (1998), and the vignette by George; contingency tables, see the vignette by Fienberg; decision analysis and decision theory, see Smith (1988), Robert (1994), Clemen (1996), and the vignette by Brown; design, see Pilz (1991), Chaloner and Verdinelli (1995), and Muller (1999); empirical Bayes, see Carlin and Louis (1996) and the vignette by Carlin and Louis; exchangeability and other foundations, see Good (1983), Regazzini (1999), Kadane, Schervish and Seidenfeld (1999), and the vignette by Robins and Wasserman; finite-population sampling, see Bolfarine and Zacks (1992) and Mukhopadhyay (1998); generalized linear models, see Dey, Ghosh, and Mallick (2000); graphical models and Bayesian networks, see Pearl (1988), Jensen (1986), Lauritzen (1996), Jordan (1998), and Cowell, Dawid, Lauritzen, and Spiegelhalter (1999); hierarchical (multilevel) modeling, see the vignette by Hobert; image processing, see Fitzgerald, Godsill, Kokaram, and Stark (1999); information, see Barron, Rissanen, and Yu (1998) and the vignette by Soofi; missing data, see Rubin (1987) and the vignette by Meng; nonparametrics and function estimation, see Dey, Muller, and Sinha (1998), Muller and Vidakovic (1999), and the vignette by Robins and Wasserman; ordinal data, see Johnson and Albert (1999); predictive inference and model averaging, see Aitchison and Dunsmore (1975), Leamer (1978), Geisser (1993), Draper (1995), Clyde (1999), and the BMA website under "software"; reliability and survival analysis, see Clarotti, Barlow, and Spizzichino (1993) and Sinha and Dey (1999); sequential analysis, see Carlin, Kadane, and Gelfand (1998) and Qian and Brown (1999); signal processing, see O Ruanaidh and Fitzgerald (1996) and Fitzgerald, Godsill, Kokaram, and Stark (1999); spatial statistics, see Wolpert and Ickstadt (1998) and Besag and Higdon (1999); testing, model selection, and variable selection, see Kass and Raftery (1995), O'Hagan (1995), Berger and Pericchi (1996), Berger (1998), Racugno (1998), Sellke, Bayarri, and Berger (1999), Thiesson, Meek, Chickering, and Heckerman (1999), and the vignette by George; time series, see Pole, West, and Harrison (1995), Kitagawa and Gersch (1996) and West and Harrison (1997).
This section presents a rather personal view of the status and future of five approaches to Bayesian analysis, termed the objective, subjective, robust, frequentist-Bayes, and quasi-Bayes approaches. This is neither a complete list of the approaches to Bayesian analysis nor a broad discussion of the considered approaches. The section's main purpose is to emphasize the variety of different and viable Bayesian approaches to statistics, each of which can be of great value in certain situations and for certain users. We should be aware of the strengths and weaknesses of each approach, as all will be with us in the future and should be respected as part of the Bayesian paradigm.
3.1 Objective Bayesian Analysis
It is a common perception that Bayesian analysis is primarily a subjective theory. This is true neither historically nor in practice. The first Bayesians, Thomas Bayes (see Bayes 1783) and Laplace (see Laplace 1812), performed Bayesian analysis using a constant prior distribution for unknown parameters. Indeed, this approach to statistics, then called "inverse probability" (see Dale 1991) was very prominent for most of the nineteenth century and was highly influential in the early part of this century. Criticisms of the use of a constant prior distribution caused Jeffreys to introduce significant refinements of this theory (see Jeffreys 1961). Most of the applied Bayesian analyses I see today follow the Laplace-Jeffreys objective school of Bayesian analysis, possibly with additional modern refinements. (Of course, others may see subjective Bayesian applications more often, depending on the area in which they work.)
Many Bayesians object to the label "objective Bayes," claiming that it is misleading to say that any statistical analysis can be truly objective. Though agreeing with this at a philosophical level (Berger and Berry 1988), I feel that there are a host of practical and sociological reasons to use thelabel; statisticians must get over their aversion to calling good things by attractive names.
The most familiar element of the objective Bayesian school is the use of noninformative or default prior distributions. The most famous of these is the Jeffreys prior (see Jeffreys 1961). Maximum entropy priors are another well-known type of noninformative prior (although they often also reflect certain informative features of the system being analyzed). The more recent statistical literature emphasizes what are called reference priors (Bernardo 1979; Yang and Berger 1997), which prove remarkably successful from both Bayesian and non-Bayesian perspectives. Kass and Wasserman (1996) provided a recent review of methods for selecting noninformative priors.
A quite different area of the objective Bayesian school is that concerned with techniques for default model selection and hypothesis testing. Successful developments in this direction are much more recent (Berger and Pericchi 1996; Kass and Raftery 1995; O'Hagan 1995; Sellke, Bayarri, and Berger 1999). Indeed, there is still considerable ongoing discussion as to which default methods are to be preferred for these problems (see Racugno 1998).
The main concern with objective Bayesian proceduresis that they often utilize improper prior distributions, and so do not automatically have desirable Bayesian properties, such as coherency. Also, a poor choice of improper priors can even lead to improper posteriors. Thus proposed objective Bayesian procedures are typically studied to ensure that such problems do not arise. Also, objective Bayesian procedures are often evaluated from non-Bayesian perspectives, and usually turn out to be stunningly effective from these perspectives.
3.2 Subjective Bayesian Analysis
Although comparatively new on the Bayesian scene, subjective Bayesian analysis is currently viewed by many Bayesians to be the "soul" of Bayesian statistics. Its philosophical appeal is undeniable, and few statisticians would argue against its use when the needed inputs (models and subjective prior distributions) can be fully and accurately specified. The difficulty in such specification (Kahneman, Slovic, and Tversky 1986) often limits application of the approach, but there has been a considerable research effort to further develop elicitation techniques for subjective Bayesian analysis (Lad, 1996; French and Smith 1997; The Statistician, 47, 1998).
In many problems, use of subjective prior information is clearly essential, and in others it is readily available; use of subjective Bayesian analysis for such problems can provide dramatic gains. Even when a complete subjective analysis is not feasible, judicious use of partly subjective and partly objective prior distributions is often attractive (Andrews, Berger, and Smith 1993).
Robust Bayesian analysis recognizes the impossibility of complete subjective specification of the model and prior distribution; after all, complete specification would involve an infinite number of assessments, even in the simplest situations. The idea is thus to work with classes of models and classes of prior distributions, with the classes reflecting the uncertainty remaining after the (finite) elicitation efforts. (Classes could also reflect the differing judgments of various individuals involved in the decision process.)
The foundational arguments for robust Bayesian analysis are compelling (Kadane 1984; Walley 1991), and there is an extensive literature on the development of robust Bayesian methodology, including Berger (1985, 1994), Berger et al. (1996), and Rios Insua (1990). Routine practical implementation of robust Bayesian analysis will require development of appropriate software, however.
Robust Bayesian analysis is also an attractive technology for actually implementing a general subjective Bayesian elicitation program. Resources (time and money) for subjective elicitation typically are very limited in practice, and need to be optimally utilized. Robust Bayesian analysis can, in principle, be used to direct the elicitation effort, by first assessing if the current information (elicitations and data) is sufficient for solving the problem and then, if not, determining which additional elicitations would be most valuable (Liseo, Petrella, and Salinetti 1996).
3.4 Frequentist Bayes Analysis
It is hard to imagine that the current situation, with several competing foundations for statistics, will exist indefinitely. Assuming that a unified foundation is inevitable, what will it be? Today, an increasing number of statisticians envisage that this unified foundation will be a mix of Bayesian and frequentist ideas (with elements of the current likelihood theory thrown in; see the vignette by Reid). Here is my view of what this mixture will be.
First, the language of statistics will be Bayesian. Statistics is about measuring uncertainty, and over 50 years of efforts to prove otherwise have convincingly demonstrated that the only coherent language in which to discuss uncertainty is the Bayesian language. In addition, the Bayesian language is an order of magnitude easier to understand than the classical language (witness the p value controversy; Sellke et al. 1999), so that a switch to the Bayesian language should considerably increase the attractiveness of statistics. Note that, as discussed earlier, this is not about subjectivity or objectivity; the Bayesian language can be used for either subjective or objective statistical analysis.
On the other hand, from a methodological perspective, it is becoming clear that both Bayesian and frequentist methodology is going to be important. For parametric problems, Bayesian analysis seems to have a clear methodological edge, but frequentist concepts can be very useful, especially in determining good objective Bayesian procedures (see, e.g., the vignette by Reid).
In nonparametric analysis, it has long been known (Diaconis and Freedman 1986) that Bayesian procedures can behave poorly from a frequentist perspective. Although poor frequentist performance is not necessarily damning to a Bayesian, it typically should be viewed as a warning sign that something is amiss, especially when the prior distribution used contains more "hidden" information than elicited information (as is virtually always the case with nonparametric priors).
Furthermore, there are an increasing number of examples in which frequentist arguments yield satisfactory answers quite directly, whereas Bayesian analysis requires a formidable amount of extra work. (The simplest such example is MCMC itself, in which one evaluates an integral by a sample average, and not by a formal Bayesian estimate; see the vignette by Robins and Wasserman for other examples). In such cases, I believe that the frequentist answer can be accepted by Bayesians as an approximate Bayesian answer, although it is not clear in general how this can be formally verified.
This discussion of unification has been primarily from a Bayesian perspective. From a frequentist perspective, unification also seems inevitable. It has long been known that "optimal" unconditional frequentist procedures must be Bayesian (Berger 1985), and there is growing evidence that this must be so even from a conditional frequentist perspective (Berger, Boukai, and Wang 1997).
Note that I am not arguing for an eclectic attitude toward statistics here; indeed, I think the general refusal in our field to strive for a unified perspective has been the single biggest impediment to its advancement. I am simply saying that any unification that will be achieved will almost necessarily have frequentist components to it.
There is another type of Bayesian analysis that one increasingly sees being performed, and that can be unsettling to "pure" Bayesians and many non-Bayesians. In this type of analysis, priors are chosen in various ad hoc fashions, including choosing vague proper priors, choosing priors to "span" the range of the likelihood, and choosing priors with tuning parameters that are adjusted until the answer "looks nice." I call such analyses quasi-Bayes because, although they utilize Bayesian machinery, they do not carry the guarantees of good performance that come with either true subjective analysis or (well-studied) objective Bayesian analysis. It is useful to briefly discuss the possible problems with each of these quasi-Bayes procedures.
Using vague proper priors will work well when the vague proper prior is a good approximation to a good objective prior, but this can fail to be the case. For instance, in normal hierarchical models with a "higher-level" variance V, it is quite common to use the vague proper prior density pi(V) proportional to V[sup -(epsilon + 1)] exp(-epsilon'/V), with epsilon and epsilon' small. However, as epsilon Arrow right 0, it is typically the case in these models that the posterior distribution for V will pile up its mass near 0, so that the answer can be ridiculous if epsilon is too small. An objective Bayesian who incorrectly used the related prior pi(V) proportional to V[sup -1] would typically become aware of the problem, because the posterior would not converge (as it will with the vague proper prior). The common perception that using a vague proper prior is safer than using improper priors, or conveys some type of guarantee of good performance, is simply wrong.
The second common quasi-Bayes procedure is to choose priors that span the range of the likelihood function. For instance, one might choose a uniform prior over a range that includes most of the "mass" of the likelihood function but that does not extend too far (thus hopefully avoiding the problem of using a "too vague" proper prior). Another version of this procedure is to use conjugate priors, with parameters chosen so that the prior is considerably more spread out than the likelihood function but is roughly centered in the same region. The two obvious concerns with these strategies are that (a) the answer can still be quite sensitive to the spread of the rather arbitrarily chosen prior, and (b) centering the prior on the likelihood is a problematical double use of the data. Also, in problems with complicated likelihoods, it can be difficult to implement this strategy successfully.
The third common quasi-Bayes procedure is to write down proper (often conjugate) priors with unspecified parameters, and then treat these parameters as "tuning" parameters to be adjusted until the answer "looks nice." Unfortunately, one is sometimes not told that this has been done; that is, the choice of the parameters is, after the fact, presented as "natural."
These issues are complicated by the fact that in the hands of an expert Bayesian analyst, the quasi-Bayes procedures mentioned here can be quite reasonable, in that the expert may have the experience and skill to tell when the procedures are likely to be successful. Also, one must always consider the question: What is the alternative? I have seen many examples in which an answer was required and in which I would trust the quasi-Bayes answer more than the answer from any feasible alternative analysis.
Finally, it is important to recognize that the genie cannot be put back into the bottle. The Bayesian "machine" together with MCMC, is arguably the most powerful mechanism ever created for processing data and knowledge. The quasi-Bayes approach can rather easily create procedures of astonishing flexibility for data analysis, and its use to create such procedures should not be discouraged. However, it must be recognized that these procedures do not necessarily have intrinsic Bayesian justifications, and so must be justified on extrinsic grounds (e.g., through extensive sensitivity studies, simulations, etc.).
4.1 Computational Techniques
Even 20 years ago, one often heard the refrain that "Bayesian analysis is nice conceptually; too bad it is not possible to compute Bayesian answers in realistic situations." Today, truly complex models often can only be computationally handled by Bayesian techniques. This has attracted many newcomers to the Bayesian approach and has had the interesting effect of considerably reducing discussion of "philosophical" arguments for and against the Bayesian position.
Although other goals are possible, most Bayesian computation is focused on calculation of posterior expectations, which are typically integrals of one to thousands of dimensions. Another common type of Bayesian computation is calculation of the posterior mode (as in computating MAP estimates in image processing).
The traditional numerical methods for computing posterior expectations are numerical integration, Laplace approximation, and Monte Carlo importance sampling. Numerical integration can be effective in moderate (say, up to 10) dimensional problems. Modern developments in this direction were discussed by Monahan and Genz (1996). Laplace and other saddlepoint approximations are discussed in the vignette by R. Strawderman. Until recently, Monte Carlo importance sampling was the most commonly used traditional method of computing posterior expectations. The method can work in very large dimensions and has the nice feature of producing reliable measures of the accuracy of the computation.
Today, MCMC has become the most popular method of Bayesian computation, in part because of its power in handling very complex situations and in part because it is comparatively easy to program. Because the Gibbs sampling vignette by Gelfand and the MCMC vignette by Cappe and Robert both address this computational technique, I do not discuss it here. Recent books in the area include those of Chen, Shao, and Ibrahim (2000), Gamerman (1997), Robert and Casella (1999), and Tanner (1993). It is not strictly the case that MCMC is replacing the more traditional methods listed above. For instance, in some problems importance sampling will probably always remain the computational method of choice, as will standard numerical integration in low dimensions (especially when extreme accuracy is needed).
Availability of general user-friendly Bayesian software is clearly needed to advance the use of Bayesian methods. A number of software packages exist, and these are very useful for particular scenarios. Lists and description of pre-1990 Bayesian software were provided by Goel (1988) and Press (1989). A list of some of the Bayesian software developed since 1990 is given in Appendix B.
It would, of course, be wonderful to have a single general-purpose Bayesian software package, but three of the major strengths of the modern Bayesian approach create difficulties in developing generic software. One difficulty is the extreme flexibility of Bayesian analysis, with virtually any constructed model being amenable to analysis. Most classical packages need to contend with only a relatively few well-defined models or scenarios for which a classical procedure has been determined. Another strength of Bayesian analysis is the possibility of extensive utilization of subjective prior information, and many Bayesians tend to feel that software should include an elaborate expert system for prior elicitation. Finally, implementing the modern computational techniques in a software package is extremely challenging, because it is difficult to codify the "art" of finding a successful computational strategy in a complex situation.
Note that development of software implementing the objective Bayesian approach for "standard" statistical models can avoid these difficulties. There would be no need for a subjective elicitation interface, and the package could incorporate specific computational techniques suited to the various standard models being considered. Because the vast majority of statistical analyses done today use such "automatic" software, having a Bayesian version would greatly impact the actual use of Bayesian methodology. Its creation should thus be a high priority for the profession.
Albert, J. H. (1996), Bayesian Computation Using Minitab, Belmont, CA: Wadsworth.
Andrews, R., Berger, J., and Smith, M. (1993), "Bayesian Estimation of Fuel Economy Potential due to Technology Improvements," in Case Studies in Bayesian Statistics, eds. C. Gatsonis et al., New York: Springer-Verlag, pp. 1-77.
Antleman, G. (1997), Elementary Bayesian Statistics, Hong Kong: Edward Elgar.
Aitchison, J., and Dunsmore, I. R. (1975), Statistical Prediction Analysis, New York: Wiley.
Barron, A., Rissanen, J., and Yu, B. (1998), "The Minimum Description Length Principle in Coding and Modeling," IEEE Transactions on Information Theory, 44, 2743-2760.
Bayes, T. (1783), "An Essay Towards Solving a Problem in the Doctrine of Chances," Philosophical Transactions of the Royal Society, 53, 370-418.
Bayarri, M. J., and Berger, J. (2000), "P-Values for Composite Null Models," Journal of the American Statistical Association, 95, 1127-1142.
Berger, J. (1985), Statistical Decision Theory and Bayesian Analysis (2nd ed.), New York: Springer-Verlag.
----- (1994), "An Overview of Robust Bayesian Analysis," Test, 3, 5-124.
----- (1998), "Bayes Factors," in Encyclopedia of Statistical Sciences, Volume 3 (update), eds. S. Kotz et al., New York: Wiley.
Berger, J., and Berry, D. (1988), "Analyzing Data: Is Objectivity Possible?," The American Scientist, 76, 159-165.
Berger, J., Betro, B., Moreno, E., Pericchi, L. R., Ruggeri, F., Salinetti, G., and Wasserman, L. (Eds.) (1996), Bayesian Robustness, Hayward, CA: Institute of Mathematical Statistics.
Berger, J., Boukai, B., and Wang, W. (1997), "Unified Frequentist and Bayesian Testing of Precise Hypotheses" Statistical Science, 12, 133-160.
Berger, J., and Pericchi, L. R. (1996), "The Intrinsic Bayes Factor for Model Selection and Prediction," Journal of the American Statistical Association, 91, 109-122.
Berliner, L. M., Royle, J. A., Wikle, C. K., and Milliff, R. F. (1999), "Bayesian Methods in the Atmospheric Sciences," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 83-100.
Bernardo, J. M. (1979), "Reference Posterior Distributions for Bayesian Inference" (with discussion)," Journal of the Royal Statistical Society, Ser. B, 41, 113-147.
Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M. (Eds.) (1999), Bayesian Statistics 6, London: Oxford University Press.
Bernardo, J. M., and Smith, A. F. M. (1994), Bayesian Theory, New York: Wiley.
Berry, D. A. (1996), Statistics: a Bayesian Perspective, Belmont, CA: Wadsworth.
Berry, D. A., and Stangl, D. K. (Eds.) (1996), Bayesian Biostatistics, New York: Marcel Dekker.
----- (2000), Meta-Analysis in Medicine and Health Policy, New York: Marcel Dekker.
Besag, J., and Higdon, D. (1999), "Bayesian Inference for Agricultural Field Experiments" (with Discussion), Journal of the Royal Statistical Society, Ser. B, 61, 691-746.
Bolfarine, H., and Zacks, S. (1992), Prediction Theory for Finite Populations, New York: Springer-Verlag.
Box, G., and Tiao, G. (1973), Bayesian Inference in Statistical Analysis, Reading, MA: Addison-Wesley.
Bretthorst, G. L. (1988), Bayesian Spectrum Analysis and Parameter Estimation, New York: Springer-Verlag.
Broemeling, L. D. (1985), Bayesian Analysis of Linear Models, New York: Marcel Dekker.
Brown, P. J. (1993), Measurement, Regression, and Calibration, Oxford, U.K.: Clarendon Press.
Buck, C. E., Cavanagh, W. G., and Litton, C. D. (1996), The Bayesian Approach to Interpreting Archaeological Data, New York: Wiley.
Carlin, B. P., and Louis, T. A. (1996), Bayes and Empirical Bayes Methods for Data Analysis, London: Chapman and Hall.
Carlin, B., Kadane, J., and Gelfand, A. (1998), "Approaches for Optimal Sequential Decision Analysis in Clinical Trials," Biometrics, 54, 964-975.
Chaloner, K., and Verdinelli, I. (1995), "Bayesian Experimental Design: A Review," Statistical Science, 10, 273-304.
Chen, M. H., Shao, Q. M., and Ibrahim, J. G. (2000), Monte Carlo Methods in Bayesian Computation, New York: Springer-Verlag.
Clarotti, C. A., Barlow, R. E., and Spizzichino, F. (Eds.) (1993), Reliability and Decision Making, Amsterdam: Elsevier Science.
Clemen, R. T. (1996), Making Hard Decisions: An Introduction to Decision Analysis (2nd ed.), Belmont, CA: Duxbury.
Clyde, M. A. (1999), "Bayesian Model Averaging and Model Search Strategies," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 23-42.
Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999), Probabilistic Networks and Expert Systems, New York: Springer.
Cyert, R. M., and DeGroot, M. H. (1987), Bayesian Analysis in Economic Theory, Totona, NJ: Rowman Littlefield.
Dale, A. I. (1991), A History of Inverse Probability, New York: Springer-Verlag.
Dawid, A. P., and Pueschel, J. (1999), "Hierarchical Models for DNA Profiling Using Heterogeneous Databases," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 187-212.
de Finetti, B. (1974, 1975), Theory of Probability, Vols. 1 and 2, New York: Wiley.
DeGroot, M. H. (1970), Optimal Statistical Decisions, New York: McGraw-Hill.
DeGroot, M. H., Fienberg, S. E., and Kadane, J. B. (1986), Statistics and the Law, New York: Wiley.
Dey, D., Ghosh, S., and Mallick, B. K. (Eds.) (2000), Bayesian Generalized Linear Models, New York: Marcel Dekker.
Dey, D., Muller, P., and Sinha, D. (Eds.) (1998), Practical Nonparametric and Semiparametric Bayesian Statistics, New York: Springer-Verlag.
Diaconis, P., and Freedman, D, (1986), "On the Consistency of Bayes Estimates," The Annals of Statistics, 14, 1-67.
Draper, D. (1995), "Assessment and Propagation of Model Uncertainty," Journal of the Royal Statistical Society, Ser. B, 57, 45-98.
Erickson, G., Rychert, J., and Smith, C. R. (1998), Maximum Entropy and Bayesian Methods, Norwell, MA: Kluwer Academic.
Fitzgerald, W. J., Godsill, S. J., Kokaram, A. C., and Stark, J. A. (1999), "Bayesian Methods in Signal and Image Processing," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 239-254.
Florens, J. P., Mouchart, M., and Roulin, J. M. (1990), Elements of Bayesian Statistics, New York: Marcel Dekker.
French, S., and Smith, J. Q. (Eds.) (1997), The Practice of Bayesian Analysis, London: Arnold.
Gamerman, D. (1997), Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, London: Chapman and Hall.
Gatsonis, C., Kass, R., Carlin, B. P., Carriquiry, A. L., Gelman, A., Verdinelli, I., and West, M. (Eds.) (1998), Case Studies in Bayesian Statistics IV, New York: Springer-Verlag.
Geisser, S. (1993), Predictive Inference: An Introduction, London: Chapman and Hall.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman and Hall.
Geweke, J. (1999), "Using Simulation Methods for Bayesian Econometric Models: Inference, Development and Communication" (with discussion), Econometric Reviews, 18, 1-73.
Glymour, C., and Cooper, G. (Eds.) (1999), Computation, Causation, and Discovery, Cambridge, MA: MIT Press.
Godsill, S. J., and Rayner, P. J. W. (1998), Digital Audio Restoration, Berlin: Springer.
Goel, P. (1988), "Software for Bayesian Analysis: Current Status and Additional Needs," in Bayesian Statistics 3, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press.
Goldstein, M. (1998), "Bayes Linear Analysis," in Encyclopedia of Statistical Sciences, Update Vol. 3, New York: Wiley.
Good, I. J. (1983), Good Thinking: The Foundations of Probability and Its Applications, Minneapolis: University of Minnesota Press.
Greenland, S. (1998), "Probability Logic and Probability Induction," Epidemiology, 9, 322-332.
Hartigan, J. A. (1983), Bayes Theory, New York: Springer-Verlag.
Howson, C., and Urbach, P. (1990), Scientific Reasoning: The Bayesian Approach, La Salle, IL: Open Court.
Iversen, E. Jr., Parmigiani, G., and Berry, D. (1998), "Validating Bayesian Prediction Models: A Case Study in Genetic Susceptibility to Breast Cancer," in Case Studies in Bayesian Statistics IV, eds. C. Gatsonis, R. E. Kass, B. Carlin, A. Carriquiry, A. Gelman, I. Verdinelli, and M. West, New York: Springer-Verlag.
Jaynes, E. T. (1999), "Probability Theory: The Logic of Science," accessible at http://bayes.wustl.edu/etj/prob.html.
Jeffreys, H. (1961), Theory of Probability (3rd ed.), London: Oxford University Press.
Jensen, F. V. (1996), An Introduction to Bayesian Networks, London: University College of London Press.
Johnson, V. E. (1997), "An Alternative to Traditional GPA for Evaluating Student Performance," Statistical Science, 12, 251-278.
Johnson, V. E., and Albert, J. (1999), Ordinal Data Models, New York: Springer-Verlag.
Jordan, M. I. (Ed.) (1998), Learning in Graphical Models, Cambridge, MA: MIT Press.
Kadane, J. (Ed.) (1984), Robustness of Bayesian Analysis, Amsterdam: North-Holland.
----- (Ed.) (1996), Bayesian Methods and Ethics in a Clinical Trial Design, New York: Wiley.
Kadane, J., Schervish, M., and Seidenfeld, T. (Eds.) (1999), Rethinking the Foundations of Statistics, Cambridge, U.K.: Cambridge University Press.
Kadane, J., and Schuan, D. A. (1996), A Probabilistic Analysis of the Sacco and Vanzetti Evidence, New York: Wiley.
Kahneman, D., Slovic, P., and Tversky, A. (1986), Judgment Under Uncertainty: Heuristics and Biases, Cambridge, U.K.: Cambridge University Press.
Kass, R., and Raftery, A. (1995), "Bayes Factors and Model Uncertainty," Journal of the American Statistical Association, 90, 773-795.
Kass, R., and Wasserman, L. (1996), "The Selection of Prior Distributions by Formal Rules," Journal of the American Statistical Association, 91, 1343-1370.
Kim, S., Shephard, N., and Chib, S. (1998), "Stochastic Volatility: Likelihood Inference and Comparison With ARCH Models," Review of Economic Studies, 65, 361-194.
Kitagawa, G., and Gersch, W. (1996), Smoothness Priors Analysis of Time Series, New York: Springer.
Lad, E (1996), Operational Subjective Statistical Methods, New York: Wiley.
Lauritzen, S. L. (1996), Graphical Models, London: Oxford University Press.
Laplace, P. S. (1812), Theorie Analytique des Probabilites, Paris: Courcier.
Leamer, E. E. (1978), Specification Searches: Ad Hoc Inference With Nonexperimental Data, Chichester, U.K.: Wiley.
Lee, P. M. (1997), Bayesian Statistics: An Introduction, London: Edward Arnold.
Lindley, D. V. (1972), Bayesian Statistics, A Review, Philadelphia: SIAM.
Liseo, B., Petrella, L., and Salinetti, G. (1996), "Robust Bayesian Analysis: an Interactive Approach," in Bayesian Statistics 5, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 661-666.
Liu, J., Neuwald, A., and Lawrence, C. (1999), "Markovian Structures in Biological Sequence Alignments," Journal of the American Statistical Association, 94, 1-15.
Lynn, N., Singpurwalla, N., and Smith, A. (1998), "Bayesian Assessment of Network Reliability," SIAM Review, 40, 202-227.
Monahan, J., and Genz, A. (1996), "A Comparison of Omnibus Methods for Bayesian Computation," Computing Science and Statistics, 27, 471-480.
Muller, P. M. (1999), "Simulation-Based Optimal Design," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 459-474.
Muller, P. M., and Rios-Insua, D. (1998), "Issues in Bayesian Analysis of Neural Network Models," Neural Computation, 10, 571-592.
Muller, P. M., and Vidakovic, B. (Eds.) (1999), Bayesian Inference in Wavelet-Based Models, New York: Springer-Verlag.
Mukhopadhyay, P. (1998), Small Area Estimation in Survey Sampling, New Delhi: Naroso.
Neal, R. M. (1996), Bayesian Learning for Neural Networks, New York: Springer.
----- (1999), "Regression and Classification Using Gaussian Process Priors," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 475-501.
O'Hagan, A. (1988), Probability: Methods and Measurements, London: Chapman and Hall.
----- (1994), Kendall's Advanced Theory. of Statistics, Vol. 2B--Bayesian Inference, London: Arnold.
----- (1995), "Fractional Bayes Factors for Model Comparisons," Journal of the Royal Statistical Society, Ser. B, 57, 99-138.
O Ruanaidh, J. J. K., and Fitzgerald, W. J. (1996), Numerical Bayesian Methods Applied to Signal Processing, New York: Springer.
Parent, E., Hubert, P., Bobee, B., and Miquel, J. (Eds.) (1998), Statistical and Bayesian Methods in Hydrological Sciences, Paris: UNESCO Press.
Pearl, J. (1988), Probabilistic Inference in Intelligent Systems, San Mateo, CA: Morgan Kaufmann.
Perlman, M., and Blaug, M. (Eds.) (1997), Bayesian Analysis in Econometrics and Statistics: The Zellner View, Northhampton, MA: Edward Elgar.
Piccinato, L. (1996), Metodi per le Decisioni Statistiche, Milano: Springer-Verlag Italia.
Pilz, J. (1991), Bayesian Estimation and Experimental Design in Linear Regression (2nd ed.), New York: Wiley.
Poirier, D. J. (1995), Intermediate Statistics and Econometrics: A Comparative Approach, Cambridge, MA: MIT Press.
Pole, A., West, M., and Harrison, J. (1995), Applied Bayesian Forecasting Methods, London: Chapman and Hall.
Pollard, W. E. (1986), Bayesian Statistics for Evaluation Research, Beverly Hills, CA: Sage.
Press, J. (1989), Bayesian Statistics, New York: Wiley.
Qian, W., and Brown, P. J. (1999), "Bayes Sequential Decision Theory in Clinical Trials," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 829-838.
Racugno, W. (Ed.) (1998), Proceedings of the Workshop on Model Selection, special issue of Rassegna di Metodi Statistici ed Applicazioni, Bologna: Pitagora Editrice.
Regazzini, E. (1999), "Old and Recent Results on the Relationship Between Predictive Inference and Statistical Modelling Either in Nonparametric or Parametric Form," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 571-588.
Rios Insua, D. (1990), Sensitivity Analysis in Multiobjective Decision Making, New York: Springer-Verlag.
Robert, C. P. (1994), The Bayesian Choice: A Decision-Theoretic Motivation, New York: Springer.
Robert, C. P., and Casella, G. (1999), Monte Carlo Statistical Methods, New York: Springer-Verlag.
Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, New York: Wiley.
Savage, L. J. (1972), The Foundations of Statistics (2nd ed.), New York: Dover.
Sellke, T., Bayarri, M. J., and Berger, J. O. (1999), "Calibration of P Values for Precise Null Hypotheses," ISDS Discussion Paper 99-13, Duke University.
Schervish, M. (1995), Theory of Statistics, New York: Springer.
Sinha, D., and Dey, D. (1999), "Survival Analysis Using Semiparametric Bayesian Methods," in Practical Nonparametric and Semiparametric Statistics, Eds. D. Dey, P. Muller, and D. Sinha, New York: Springer, pp. 195-211.
Sivia, D. S. (1996), Data Analysis: A Bayesian Tutorial, London: Oxford University Press.
Smith, J. Q. (1988), Decision Analysis: A Bayesian Approach, London: Chapman and Hall.
Spirtes, P., Glymour, C., and Scheines, R. (1993), Causation, Prediction. and Search, New York: Springer-Verlag.
Stangl, D., and Berry, D. (1998), "Bayesian Statistics in Medicine: Where We are and Where We Should be Going," Sankhya, Ser. B, 60, 176-195.
Tanner, M. A. (1993), Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions (2nd ed.), New York: Springer Verlag.
Thiesson, B., Meek, C., Chickering, D. M., and Heckerman, D. (1999), "Computationally Efficient Methods for Selecting Among Mixtures of Graphical Models," in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 631-656.
Tierney, L. (1991), Lisp-Stat, An Object-Oriented Environment for Statistical Computing and Dynamic Graphics, New York: Wiley.
Walley, P. (1991), Statistical Reasoning With Imprecise Probabilities, London: Chapman and Hall.
West, M., and Harrison, J. (1997), Bayesian Forecasting and Dynamic Models (2nd ed.), New York: Springer-Verlag.
Winkler, R. L. (1972), Introduction to Bayesian Inference and Decision, New York: Holt, Rinehart, and Winston.
Wolpert, R. L., and Ickstadt, K. (1998), "Poisson/Gamma Random Field Models for Spatial Statistics," Biometrika, 82, 251-267.
Yang, R., and Berger, J. (1997), "A Catalogue of Noninformative Priors," ISDS Discussion Paper 97-42, Duke University.
Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics, New York: Wiley.
• Historical and general monographs: Laplace (1812), Jeffreys (1961), Zellner (1971), Savage (1972), Lindley (1972), Box and Tiao (1973), de Finetti (1974, 1975), Hartigan (1983), Florens, Mouchart, and Roulin (1990)
• Graduate-level texts: DeGroot (1970), Berger (1985), Press (1989), Bernardo and Smith (1994), O'Hagan (1994), Robert (1994), Gelman, Carlin, Stern, and Rubin (1995), Poirier (1995), Schervish (1995), Piccinato (1996)
• Elementary texts: Winkler (1972), O'Hagan (1988), Albert (1996), Berry (1996), Sivia (1996), Antleman (1997), Lee (1997)
• General proceedings volumes: The International Valencia Conferences produce highly acclaimed proceedings, the last of which was edited by Bernardo et al. (1999). The Maximum Entropy and Bayesian Analysis conferences also have excellent proceedings volumes, the last of which was edited by Erickson, Rychert, and Smith (1998). The CMU Bayesian Case Studies Workshops produce unique volumes of in-depth case studies in Bayesian analysis, the last volume being edited by Gatsonis et al. (1998). The Bayesian Statistical Science Section of the ASA has an annual JSM proceedings volume, produced by the ASA.
• AutoClass, a Bayesian classification system (http://ic-www. arc.nasa.gov/ic/projects/bayes-group/group/autoclass/)
• BATS, designed for Bayesian time series analysis (http:// www.stat.duke.edu/~mw/bats.html)
• BAYDA, a Bayesian system for classification and discriminant analysis (http://www.cs.Helsinki.fi/research/cosco/ Projects/NONE/SW/)
• BAYESPACK, etc., numerical integration algorithms (http:// www.math.wsu.edu/math/faculty/genz/homepage)
• Bayesian biopolymer sequencing software (http://www-stat.stanford.edu/~jliu/)
• B/D, a linear subjective Bayesian system (http://fourier. dur.ac.uk:8000/stats/bd/)
• BMA, software for Bayesian model averaging for predictive and other purposes (http://www.research.att.com/~volinsky /bma.html)
• Bayesian regression and classification software based on neural networks, Gaussian processes, and Bayesian mixture models (http://www.cs.utoronto.ca/~radford/fbm.software. html)
• Belief networks software (http://bayes.stat.washington.edu/ almond/belief.html)
• BRCAPRO, which implements a Bayesian analysis for genetic counseling of women at high risk for hereditary breast and ovarian cancer (http://www.stat.duke.edu/~gp/ brcapro.html)
• BUGS, designed to analyze general hierarchical models via MCMC (http://www.mrc-bsu.cam.ac.uk/bugs/)
• First Bayes, a Bayesian teaching package (http://www.shef. ac.uk/~st1ao/1b.html)
• Matlab and Minitab Bayesian computational algorithms for introductory Bayes and ordinal data (http://www-math.bgsu.edu/~albert/)
• Nuclear magnetic resonance Bayesian software; this is the manual (http://www.bayes.wustl.edu/glb/manual.pdf)
• StatLib, a repository for statistics software, much of it Bayesian (http://lib.stat.cmu.edu/)
• Time series software for nonstationary time series and analysis with autoregressive component models (http://www.stat.duke.edu/~mw/books_software_data.html)
• LISP-STAT, an object-oriented environment for statistical computing and dynamic graphics with various Bayesian capabilities (Tierney 1991)
~~~~~~~~
By James O. Berger
James O. Berger is Arts and Sciences Professor of Statistics, Institute of
Statistics and Decision Sciences, Duke University, Durham, NC 27708 (E-mail:
berger@stat.duke.edu). Preparation was supported by National Science Foundation
grant DMS-9802261. The author is grateful to George Casella, Dalene Stangl, and
Michael Lavine for helpful suggestions.
Title: | Gibbs Sampling. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Provides information on the Gibbs sampling in statistics. Its definition; Background on Gibbs sampling; Its impact on the statistical community. |
AN: | 3851568 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
During the course of the 1990s, the technology generally referred to as Markov chain Monte Carlo (MCMC) has revolutionized the way statistical models are fitted and, in the process, dramatically revised the scope of models which can be entertained.
This vignette focuses on the Gibbs sampler. I provide a review of its origins and its crossover into the mainstream statistical literature. I then attempt an assessment of the impact of Gibbs sampling on the research community, on both statisticians and subject area scientists. Finally, I offer some thoughts on where the technology is headed and what needs to be done as we move into the next millennium. The perspective is, obviously, mine, and I apologize in advance for any major omissions. In this regard, my reference list is modest, and again there may be some glaring omissions. I present little technical discussion, as by now detailed presentations are readily available in the literature. The books of Carlin and Louis (2000), Gelman, Carlin, Stern, and Rubin (1995), Robert and Casella (1999), and Tanner (1993) are good places to start. Also, within the world of MCMC, I adopt an informal definition of a Gibbs sampler. Whereas some writers describe "Metropolis steps within Gibbs sampling," others assert that the blockwise updating implicit in a Gibbs sampler is a special case of a "block-at-a-time" Metropolis-Hastings algorithm. For me, the crucial issue is replacement of the sampling of a high-dimensional vector with sampling of lower-dimensional component blocks, thus breaking the so-called curse of dimensionality.
In Section 2 I briefly review what the Gibbs sampler is, how it is implemented, and how it is used to provide inference. With regard to Gibbs sampling, Section 3 asks thequestion "How did it make its way into the mainstream of statistics?" Section 4 asks "What has been the impact?" Finally, Section 5 asks "Where are we going?" Here, speculation beyond the next decade seems fanciful.
Gibbs sampling is a simulation tool for obtaining samples from a nonnormalized joint density function. Ipso facto, such samples may be "marginalized," providing samples from the marginal distributions associated with the joint density.
2.1 Motivation
The difficulty in obtaining marginal distributions from a nonnormalized joint density lies in integration. Suppose, for example, that theta is a p x 1 vector and f(theta) is a nonnormalized joint density for theta with respect to Lebesgue measure. Normalizing f entails calculating Integral of f(theta)d theta. To marginalize, say for theta[sub i], requires h(theta[sub i]) = Is approximately equal to f(theta)d theta[sub (i)]/Integral of f(theta)d theta, where theta[sub (i)] denotes all components of theta save theta[sub i]. Integration is also needed to obtain a marginal expectation or find the distribution of a function of theta. When p is large, such integration is analytically infeasible (the curse of dimensionality). Gibbs sampling offers a Monte Carlo approach.
The most prominent application has been for inference within a Bayesian framework. Here models are specified as a joint density for the observations, say Y, and the model unknowns, say theta, in the form h(Y|theta)pi(theta). In a Bayesian setting, the observed realizations of Y are viewed as fixed, and inference proceeds from the posterior density of theta, pi(theta|Y) proportional to h(Y|theta)pi(theta) Is equivalent to f(theta), suppressing the fixed Y. So f(theta) is a nonnormalized joint density, and Bayesian inference requires its marginals and expectations, as earlier. If the prior, pi(theta), is set to 1 and if h(Y|theta) is integrable over theta, then the likelihood becomes a nonnormalized density. If marginal likelihoods are of interest, then we have the previous integration problem.
2.2 Monte Carlo Sampling and Integration
Simulation-based approaches for investigating the nonnormalized density f(theta) appeal to the duality between population and sample. In particular, if we can generate arbitrarily many observations from h(theta) = f(theta)/Integral of f(theta), so-called Monte Carlo sampling, then we can learn about any feature of h(theta) using the corresponding feature of the sample. Noniterative strategies for carrying out such sampling usually involve identification of an importance sampling density, g(theta) (see, e.g., Geweke 1989; West 1992). Given a sample from g(theta), we convert it to a sample from h(theta), by resampling, as done by Rubin (1988) and Smith and Gelfand (1992). If one only needs to compute expectations under h(theta), this can be done directly with samples from g(theta) (see, e.g., Ripley 1987) and is referred to as Monte Carlo integration. Noniterative Monte Carlo methods become infeasible for many high-dimensional models of interest.
Iterative Monte Carlo methods enable us to avoid the curse of dimensionality by sampling low-dimensional subsets of the components of theta. The idea is to create a Markov process whose stationary distribution is h(theta). This seems an unlikely strategy, but, perhaps surprisingly, there are infinities of ways to do this. Then, suppose that P(theta arrow right A) is the transition kernel of a Markov chain with stationary distribution h(theta). (Here P(theta arrow right A) denotes the probability that theta[sup (t+1)] is an element of A given theta[sup (t)] = theta.) If h[sup (0)] (theta) is a density that provides starting values for the chain, then, with theta[sup (0)] Is similar to h[sup (0)] (theta), using P(theta arrow right A), we can develop a trajectory (sample path) of the chain theta[sup (0)], theta[sup (1)], theta[sup (2)],...,theta[sup (t)],... If t is large enough (i.e., after a sufficiently long "burn-in" period), then theta[sup (t)] is approximately distributed according to h(theta).
A bit more formally, suppose that P(theta arrow right A) admits a transition density, p(n|eta), with respect to Lebesgue measure. Then pi is an invariant density for p if Integral of pi(theta)p(eta|theta) d theta = pi(eta). In other words, if theta[sup (t)] Is similar to pi, then theta[sup (t+1)] Is similar to pi. Also, GAMMA is a limiting (stationary, equilibrium) distribution for p if lim[sub t arrow right Infinity] P(theta[sup (t)] is an element of A)|theta[sup (0)] = theta) = GAMMA(A) (and thus lim[sub t arrow right Infinity] P(theta[sup (t)] is an element of A) = GAMMA(A)). The crucial result is that if p(eta|theta) is aperiodic and irreducible and if pi is a (proper) invariant distribution of p, then pi is the unique invariant distribution; that is, pi is the limiting distribution. A careful theoretical discussion of general MCMC algorithms with references was given by Tierney (1994). Also highly recommended is the set of three Royal Statistical Society papers in 1993 by Besag and Green (1993), Gilks et al. (1993), and Smith and Roberts (1993), together with the ensuing discussion, as well as an article by Besag, Green, Higdon, and Mengersen (1996), again with discussion.
2.3 The Gibbs Sampler
The Gibbs sampler was introduced as a MCMC tool in the context of image restoration by Geman and Geman (1984). Gelfand and Smith (1990) offered the Gibbs sampler as a very general approach for fitting statistical models, extending the applicability of the work of Geman and Geman and also broadening the substitution sampling ideas that Tanner and Wong (1987) proposed under the name of data augmentation.
Suppose that we partition theta into r blocks; that is, theta = (theta[sub 1],...,theta[sub r]). If the current state of theta is theta[sup (t)] = (theta[sup (t), sub 1],...,theta[sup (t), sub r]), then suppose that we make the transition to theta[sup (t+1)] as follows:
draw theta[sup (t+1), sub 1] from h(theta[sub 1]|theta[sup (t), sub 2],...,theta[sup (t), sub r]), draw theta[sup (t+1), sub 2] from h(theta[sub 2]|theta[sup (t+1), sub 1],...,theta[sup (t), sub 3],..., theta[sup (t), sub r]),
draw theta[sup (t+1), sub r] from h(theta[sub r]|theta[sup (t+1), sub 1],...,theta[sup (t+1), sub r-1]).
The distributions h(theta[sub i]|theta[sub 1],...,theta[sub i-1],...,theta[sub i+1],...,theta[sub r]) are referred to as the full, or complete, conditional distributions, and the process of updating each of the r blocks as indicated updates the entire vector theta, producing one complete iteration of the Gibbs sampler. Sampling of theta has been replaced by sampling of lower-dimensional blocks of components of theta.
2.4 How To Sample the theta[sub i]
Conceptually, the Gibbs sampler emerges as a rather straightforward algorithmic procedure. One aspect of the art of implementation is efficient sampling of the full conditional distributions. Here there are many possibilities. Often, for some of the theta[sub i], the form of the prior specification will be conjugate with the form in the likelihood, so that the full conditional distribution for theta[sub i] will be a "posterior" updating of a standard prior. Note that even if this were the case for every theta[sub i], f(theta) itself need not be a standard distribution; conjugacy may be more useful for Gibbs sampling than for analytical investigation of the entire posterior.
When f(theta[sub i]|theta[sub 1], theta[sub 2],...theta[sub i-1], theta[sub i+1],...theta[sub r]) is nonstandard, we might consider the rejection method, as discussed by Devroye (1986) and Ripley (1987); the weighted bootstrap, as discussed by Smith and Gelfand (1992); the ratio-of-uniforms method, as described by Wakefield, Gelfand, and Smith (1992); approximate cdf inversion when theta[sub i] is univariate, such as the griddy Gibbs sampler, as discussed by Ritter and Tanner (1992); adaptive rejection sampling, as often the full conditional density for theta[sub i] is log concave, in which case the usual rejection method may be adaptively improved in a computationally cheap fashion, as described by Gilks and Wild (1992); and Metropolis-within-Gibbs. For the last, the Metropolis (or Hastings-Metropolis) algorithms--which, in principle, enable simultaneous updating of the entire vector theta (Chib and Greenberg 1995; Tierney 1994)--are usually more conveniently used within the Gibbs sampler for updating some of the theta[sub i], typically those with the least tractable full-conditional densities.
The important message here is that no single procedure dominates the others for all applications. The form of h(theta) determines which method is most suitable for a given theta[sub i].
2.5 Convergence
Considerable theoretical work has been done on establishing the convergence of the Gibbs sampler for particular applications, but perhaps the simplest conditions have been given by Smith and Roberts (1993). If f(theta) is lower semi-continuous at 0, if Integral of f(theta) d theta[sub i] is locally bounded for each i, and if the support of f is connected, then the Gibbs sampler algorithm converges. In practice, a range of diagnostic tools is applied to the output of one or more sampled chains. Cowles and Carlin (1994) and Brooks and Roberts (1998) provided comparative reviews of the convergence diagnostics literature. Also, see the related discussions in the hierarchical models vignette by Hobert and the MCMC vignette by Cappe and Robert. (In principle, convergence can never be assessed using such output, as comparison can be made only between different iterations of one chain or between different observed chains, but never with the true stationary distribution.)
2.6 Inference Using the Output of the Gibbs Sampler
The retained output from the Gibbs sampler will be a set of theta[sup *, sub j], j = 1, 2,..., B, assumed to be approximately iid from h = f/Integral of f. If independently started parallel chains are used, then observations from different chains are independent but observations within a given chain are dependent. "Thinning" of the output stream (i.e., taking every kth iteration, perhaps after a burn-in period) yields approximately independent observations within the chain, for k sufficiently large. Evidently, the choice of k hinges on the autocorrelation in the chain. Hence sample autocorrelation functions are often computed to assess the dependence. Given {theta[sup *, sub j]}, for a specified feature of h we compute the corresponding feature of the sample. Because B can be made arbitrarily large, inference using {theta[sup *, sub j]} can be made arbitrarily accurate.
The Gibbs sampler was not developed by statisticians. For at least the past half-century, scientists (primarily physicists and applied mathematicians) have sought to simulate the behavior of complex probabilistic models formulated to approximate the behavior of physical, chemical, and biological processes. Such processes were typically characterized by regular lattice structure and the joint probability distribution of the variables at the lattice points was provided through local specification; That is, the full conditional density h(theta[sub i]|theta[sub j], j = 1, 2,..., r, j is not equal to i) was reduced to h(theta[sub i]|theta[sub j] is an element of N[sub i]), where N[sub i] is a set of neighbors of location i. But then an obvious question is whether the set of densities h(theta[sub i]|theta[sub j] is an element of N[sub i]), a so-called Markov random field (MRF) specification, uniquely determines h(theta). Geman and Geman (1984) argued that if each full conditional distribution is a so-called Gibbs density, the answer is yes and, in fact, that this provides an equivalent definition of a MRF. The fact that each theta[sub i] is updated by making a draw from a Gibbs distribution motivated them to refer to the entire updating scheme as Gibbs sampling.
The Gibbs sampler is, arguably, better suited for handling simulation from a posterior distribution. As noted by Gelfand and Smith (1990), h(theta[sub i]|theta[sub j], j is not equal to i) proportional to f(theta), where f(theta) is viewed as a function of theta[sub i] with all other arguments fixed. Hence we always know (at least up to normalization) the full conditional densities needed to implement the Gibbs sampler. The Gibbs sampler can also be used to investigate conditional distributions associated with f(theta), as done by Gelfand and Smith (1991). It is also well suited to the case where f(theta) arises as the restriction of a joint density to a set S (see Gelfand, Smith, and Lee 1992).
The 1990s have brought unimaginable availability of inexpensive high-speed computing. Such computing capability was blossoming at the time of Gelfand and Smith's 1990 article. The former fueled considerable experimentation with the latter, in the process demonstrating its broad practical viability. Concurrently, the increasing computing possibilities were spurring interest in a broad range of complex modeling specifications, including generalized linear mixed models, time series and dynamic models, nonparametric and semiparametric models (particularly for censored survival data), and longitudinal and spatial data models. These could all be straightforwardly fitted as Bayesian models using Gibbs sampling.
Previously, within the statistical community, Bayesians, though confident in the unification and coherence that their paradigm provides, were frustrated by the computational limitations described in Section 2.1, which restricted them to "toy" problems. Though progress was made with numerical integration approaches, analytic approximation methods, and noniterative simulation strategies, fitting the rich classes of hierarchical models that provide the real inferential benefits of the paradigm (e.g., smoothing, borrowing strength, accurate interval estimates) was generally beyond the capability of these tools. The Gibbs sampler provided Bayesians with a tool to fit models previously inaccessible to classical workers. The tables were turned; if one specified a likelihood and prior, the Gibbs sampler was ready to go!
The ensuing fallout has by and large been predictable. Practitioners and subject matter researchers, seeking to explore more realistic models for their data, have enthusiastically embraced the Gibbs sampler, and Bayesians, stimulated by such receptiveness, have eagerly sought collaborative research opportunities. An astonishing proliferation of articles using MCMC model fitting has resulted. On the other hand, classical theoreticians and methodologists, perhaps feeling somewhat threatened, find intellectual vapidity in the entire enterprise; "another Gibbs sampler paper" is a familiar retort.
Though not all statisticians participate, an ideological divide, perhaps stronger than in the past, has emerged. Bayesians will argue that with a full model specification, full inference is available. And the inference is "exact" (although an enormous amount of sampling from the posterior may be required to achieve it!), avoiding the uncertainty associated with asymptotic inference. Frequentists will raise familiar concerns with prior specifications and with inference performance under experimental replication. They also will feel uncomfortable with the black box, nonanalytic nature of the Gibbs sampler. Rather than "random" estimates, they may prefer explicit expressions that permit analytic investigation.
Moreover, Gibbs sampling, as a model fitting and data-analytic technology is fraught with the risk for abuse. MCMC methods are frequently stretched to models more complex than the data can hope to support. Inadequate investigation of convergence in high-dimensional settings is often the norm, improper posteriors surface periodically in the literature, and inference is rarely externally checked.
At this point, the Gibbs sampler and MCMC in general are well accepted and utilized for data analysis. Its use in the applied sector will continue to grow. Nonetheless, in the statistical community the frenzy over Gibbs sampling has passed, the field is now relatively stable, and future direction can be assessed. I begin with a list of "tricks of the trade," items still requiring further clarification:
• Model fitting should proceed from simplest to hardest, with fitting of simpler models providing possible starting values, mode searching, and proposal densities for harder models.
• Attention to parameterization is crucial. Given the futility of "transformation to uncorrelatedness," automatic approaches, such as that of Gelfand, Sahu, and Carlin (1995a,b), are needed. Strategies for nonlinear models are even more valuable.
• Latent and auxiliary variables are valuable devices but effective usage requires appreciation of the trade-off between simplified sampling and increased model dimension.
• When can one use the output associated with a component of theta that appears to have converged? For instance, population-level parameters, which are often of primary interest, typically converge more rapidly than individual-level parameters.
• Blocking is recognized as being helpful in handling correlation in the posterior but what are appropriate blocking strategies for hierarchical models?
• Often "hard to fit" parameters are fixed to improve the convergence behavior of a Gibbs sampler. Is an associated sensitivity analysis adequate in such cases?
• Because harder models are usually weakly identified, informative priors are typically required to obtain well behaved Gibbs samplers. How does one use the data to develop these priors and to specify them as weakly as possible?
• Good starting values are required to run multiple chains. How does one obtain "overdispersed" starting values?
With the broad range of models that can now be explored using Gibbs sampling, one naturally must address questions of model determination. Strategies that conveniently piggyback onto the output of Gibbs samplers are of particular interest as ad hoc screening and checking procedures. In this regard, see Gelfand and Ghosh (1998) and Spiegelhalter, Best, and Carlin (1998) for model choice approaches and Gelman, Meng, and Stern (1995) for model adequacy ideas.
Finally, with regard to software development, the BUGS package (Spiegelhalter, Thomas, Best, and Gilks 1995) at this point, is general and reliable enough (with no current competition) to be used both for research and teaching. CODA (Best, Cowles, and Vines 1995) is a convenient add-on to implement a medley of convergence diagnostics. The future will likely bring specialized packages to accommodate specific classes of models, such as time series and dynamic models. However, fitting cutting edge models will always require tinkering and tuning (and possibly specialized algorithms), placing it beyond extant software. But the latter can often fit simpler models before exploring harder ones and can be used to check individual code.
As for hardware, it is a given that increasingly faster machines with larger and larger storage will evolve, making feasible the execution of enormous numbers of iterations for high-dimensional models within realistic run times, diminishing convergence concerns. However, one also would expect that more capable multiprocessor machines will be challenged by bigger datasets and more complex models, encouraging parallel processing MCMC implementations.
Besag, J., and Green, P. J. (1993), "Spatial Statistics and Bayesian Computation," Journal of the Royal Statistical Society, Ser. B, 55, 25-37.
Besag, J., Green, P., Higdon, D., and Mengersen, K. (1996), "Bayesian Computation and Stochastic Systems" (with discussion), Statistical Science, 10, 3-66.
Best, N. G., Cowles, M. K., and Vines, K. (1995), "CODA: Convergence Diagnostics and Output Analysis Software for Gibbs Sampling Output, Version 0.30," Medical Research Council, Biostatistics Unit, Cambridge, U.K.
Brooks, S. P., and Roberts, G. O. (1998), "Assessing Convergence of Markov Chain Monte Carlo Algorithms," Statistics and Computing, 8, 319-335.
Carlin, B. P., and Louis, T. A. (2000), Bayes and Empirical Bayes Methods for Data Analysis (2nd ed.), London: Chapman and Hall. Chib, S., and Greenberg, E. (1995), "Understanding the Metropolis-Hastings Algorithm," American Statistician, 49, 327-335.
Cowles, M. K., and Carlin, B. P. (1994), "Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review," Journal of the American Statistical Association, 91, 883-904.
Devroye, L. (1986), Non-Uniform Random Variate Generation, New York: Springer-Verlag.
Gelfand, A. E., and Ghosh, S. K. (1998), "Model Choice: A Minimum Posterior Predictive Loss Approach," Biometrika, 85, 1-11.
Gelfand, A. E., Sahu, S. K., and Carlin, B. P. (1995a), "Efficient Parameterization for Normal Linear Mixed Effects Models," Biometrika, 82, 479-488.
----- (1995b), "Efficient Parameterization for Generalized Linear Mixed Models," in Bayesian Statistics 5, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 47-74.
Gelfand, A. E., and Smith, A. F. M. (1990), "Sampling-Based Approaches to Calculating Marginal Densities," Journal of the American Statistical Association, 85, 398-409.
Gelfand, A. E., Smith, A. F. M., and Lee, T-M. (1992), "Bayesian Analysis of Constrained Parameter and Truncated Data Problems," Journal of the American Statistical Association, 87, 523-532.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman and Hall.
Gelman, A., Meng, X-L., and Stern, H. S. (1995), "Posterior Predictive Assessment of Model Fitness via Realized Discrepancies" (with discussion), Statistica Sinica, 6, 733-807.
Geman, S., and Geman, D. (1984), "Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images," IEEE Trans. Pattern Anal. Mach. Intell., 6, 721-741.
Geweke, J. F. (1989), "Bayesian Inference in Econometric Models Using Monte Carlo Integration," Econometrika, 57, 1317-1340.
Gilks, W. R., Clayton, D. G., Spiegelhalter, D. J., Best, N. E., McNeil, A. J., Sharpies, L. D., and Kirby, A. J. (1993), "Modeling Complexity: Applications of Gibbs Sampling in Medicine," Journal of the Royal Statistical Society, Ser. B, 55, 39-52.
Gilks, W. R., and Wild, P. (1992), "Adaptive Rejection Sampling for Gibbs Sampling," Journal of the Royal Statistical Society, Ser. C, 41, 337-348.
Ripley, B. D. (1987), Stochastic Simulation, New York: Wiley.
Ritter, C., and Tanner, M. A. (1992), "The Gibbs Stopper and the Griddy Gibbs Sampler," Journal of the American Statistical Association, 87, 861-868.
Robert, C. P., and Casella, G. (1999), Monte Carlo Statistical Methods, New York: Springer-Verlag.
Rubin, D. B. (1988), "Using the SIR Algorithm to Simulate Posterior Distributions," in Bayesian Statistics 3, eds. J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, London: Oxford University Press, pp. 395-402.
Smith, A. F. M., and Gelfand, A. E. (1992), "Bayesian Statistics Without Tears," American Statistician, 46, 84-88.
Smith, A. F. M., and Roberts, G. O. (1993), "Bayesian Computation via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods," Journal of the Royal Statistical Society, Ser. B, 55, 3-23.
Spiegelhalter, D., Best, N., and Carlin, B. P. (1998), "Bayesian Deviance, the Effective Number of Parameters and the Comparison of Arbitrarily Complex Models," technical report, MRC Biostatistics Unit, Cambridge, U.K.
Spiegelhalter, D. J., Thomas, A., Best, N., and Gilks, W. R. (1995), "BUGS: Bayesian Inference Using Gibbs Sampling, Version 0.50," Medical Research Council, Biostatistics Unit, Cambridge, U.K.
Tanner, M. A. (1993), Tools for Statistical Inference (2nd ed.), New York: Springer-Verlag.
Tanner, M. A., and Wong, W. H. (1987), "The Calculation of Posterior Distributions by Data Augmentation," Journal of the American Statistical Association, 82, 528-540.
Tierney, L. (1994), "Markov Chains for Exploring Posterior Distributions," Ann. Statist., 22, 1701-1762.
Wakefield, J., Gelfand, A. E., and Smith, A. F. M. (1992), "Efficient Computation of Random Variates via the Ratio-of-Uniforms Method," Statist. and Comput., 1, 129-133.
West, M. (1992), "Modeling With Mixtures," in Bayesian Statistics 4, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 503-524.
~~~~~~~~
By Alan E. Gelfand
Alan E. Gelfand is Professor, Department of Statistics, University of
Connecticut, Storrs, CT 06269. (E-mail: alan@stat.uconn.edu). His work was
supported in part by National Science Foundation grant DMS 96-25383.
Title: | Hypothesis Testing: From p Values to Bayes Factors. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Deals with hypothesis testing from p values to Bayes factors. Definition of hypothesis testing; Description on the classical hypothesis testing; Information on the Bayesian hypothesis testing. |
AN: | 3851579 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
Testing hypotheses involves deciding on the plausibility of two or more hypothetical statistical models based on some data. It may be that there are two hypothesis on equal footing; for example, the goal of Mosteller and Wallace (1984) was to decide which of Alexander Hamilton or James Madison wrote a number of the Federalist Papers. It is more common that there is a particular null hypothesis that is a simplification of a larger model, such as when testing whether a population mean is 0 versus the alternative that it is not 0. This null hypothesis may be something that one actually believes could be (approximately) true, such as the specification of an astronomical constant; or something that one believes is false but is using as a straw man, such as the efficacy of a new drug is null; or something that one hopes is true for convenience's sake, such as equal variances in the analysis of variance.
Formally, we assume that the data are represented by x, which has density f[sub theta](x) for parameter theta. We test the null versus the alternative hypotheses,
H[sub 0]: theta is an element of THETA[sub 0] versus H[sub A]: theta is an element of THETA[sub A],
where THETA[sub 0] and THETA[sub A] are disjoint subsets of the parameter space. A more general view allows comparison of several models.
In 1908, W. S. Gossett (Student 1908) set the stage for twentieth century "classical" hypothesis testing. He derived the distribution of the Student t statistic and used it to test null hypotheses about one mean and the difference of two means. His approach is used widely today. First, choose a test statistic. Next, calculate its p value, which is the probability, assuming that the null hypothesis is true, that one would obtain a value of the test statistic as or more extreme than that obtained from the data. Finally, interpret this p value as the probability that the null hypothesis is true. Of course, the p value is not the probability that the null hypothesis is true, but small values do cause one to doubt the null hypothesis. R. A. Fisher (1925) promoted the p value for testing in a wide variety of problems, rejecting a null hypothesis when the p value is too small: "We shall not often be astray if we draw a conventional line at .05."
Jerzy Neyman and Egon Pearson (1928, 1933) shifted focus from the p value to the fixed-level test. The null hypothesis is rejected if the test statistic exceeds a constant, where the constant is chosen so that the probability of rejecting the null hypothesis when the null hypothesis is true equals (or is less than or equal to) a prespecified level. Their work had major impact on the theory and methods of hypothesis testing.
Test Statistics Became Evaluated on Their Risk Functions. Neyman and Pearson set out to find "efficient" tests. Their famous and influential result, the Neyman-Pearson lemma, solved the problem when both hypotheses are simple; that is, THETA[sub 0] and THETA[sub A] each contain exactly one point. Letting 0 and A be those points, the likelihood ratio test rejects the null hypothesis when f[sub A](x)/f[sub 0](z) exceeds a constant. This test maximizes the power (the probability of rejecting the null when the alternative is true) among tests with its level. For situations in which one or both of the hypotheses are composite (not simple), there is typically no uniformly best test, whence different tests are compared based on their risk functions. The risk function is a function of the parameters in the model, being the probability of making an error: rejecting the null when the parameter is in the null or accepting the null when the parameter is in the alternative.
The Maximum Likelihood Ratio Test Became a Key. Method. When a hypothesis is composite, Neyman and Pearson replace the density at a single parameter value with the maximum of the density over all parameters in that hypothesis. The maximum likelihood ratio test rejects the null hypothesis for large values of sup[sub theta is an element of THETA[sub A]]f[sub theta](x)/sup[sub theta is an element of THETA[sub 0]] f[sub theta](Z). This is an extremely useful method for finding good testing procedures. It can be almost universally applied; under fairly broad conditions, it has a nice asymptotic chi-squared distribution under the null hypothesis, and usually (but not always) has reasonable operating characteristics.
2.1 Theory
Abraham Wald (1950) codified the decision-theoretic approach to all of statistics that Neyman and Pearson took to hypothesis testing. Erich Lehmann (1959) was singularly influential, setting up the framework for evaluating statistical procedures based on their risk functions. A battery of criteria based on the risk function was used in judging testing procedures, including uniformly most powerful, locally most powerful, asymptotically most powerful, unbiasedness, similarity, admissibility or inadmissibility, Bayes, minimaxity, stringency, invariance, consistency, and asymptotic efficiencies (Bahadur, Chernoff, Hodges-Lehmann, Rubin-Sethuraman--see Serfling 1980).
Authors who have applied these criteria to various problems include Birnbaum (1955), Brown, Cohen, and Strawderman (1976, 1980), Brown and Marden (1989), Cohen and Sackrowitz (1987), Eaton (1970), Farrell (1968), Kiefer and Schwartz (1965), Marden (1982), Matthes and Truax (1967), Perlman (1980), Perlman and Olkin (1980), and Stein (1956).
A topic of substantial current theoretical interest is the use of algebraic structures in statistics (see Diaconis 1988). Testing problems often exhibit certain symmetries that can be expressed in terms of groups. Such problems can be reduced to so-called maximal invariant parameters and maximal invariant statistics, which are typically of much lower dimension than the original quantities. Stein's method (Andersson 1982; Wijsman 1967) integrates the density over 0 using special measures to obtain the density of the maximal invariant statistic, which can then be used to analyze the problems; for example, to find the uniformly most powerful invariant test.
The real payoff comes in being able to define entire classes of testing problems, unifying many known models and suggesting useful new ones. For example, Andersson (1975) used group symmetry to define a large class of models for multivariate normal covariance matrices, including independence of blocks of variates, intraclass correlation, equality of covariance matrices, and complex and quaternion structure. Other models based on graphs or lattices can express complicated independence and conditional independence relationships among continuous or categorical variables (see, e.g., Andersson and Perlman 1993; Lanritzen 1996; Wermuth and Cox 1992; Whitaker 1990). Such "meta-models" not only allow expression of complicated models, but also typically provide unified systematic analysis, including organized processes for implementing likelihood procedures.
2.2 Methods
As far as general methods for hypothesis testing, the granddaddy of them all has to be the chi-squared test of Karl Pearson (1900). Its extensions permeate practice, especially in categorical models. But the most popular general test is the maximum likelihood ratio test. Related techniques were developed by Rao (1947) and Wald (1943). Cox (1961) explored these methods for nonnested models; that is, models for which neither hypothesis is contained within the closure of the other. The union-intersection principle of Roy (1953) provides another useful testing method.
The general techniques have been applied in so many innovative ways that it is hard to know where to begin. Classic works that present important methods and numerous references include books by Anderson (1984) on normal multivariate analysis (notable references include Bartlett 1937; Hotelling 1931; Mahalanobis 1930); Bishop, Fienberg, and Holland (1975) on discrete multivariate analysis (see also Goodman 1968; Haberman 1974); Barlow, Bartholomew, Bremner, and Brunk (1972) and Robertson, Wright and Dykstra (1988) on order-restricted inference; and Rao (1973) on everything.
Fisher (1935) planted the seeds for another crop of methods with his randomization tests, in which the p value is calculated by rerandomizing the actual data. These procedures yield valid p values under much broader distributional assumptions than are usually assumed. The field of nonparametrics grew out of such ideas, the goal being to have testing procedures that work well even if the all of the assumptions do not hold exactly (see Gibbons and Chakraborti 1992; Hettmansperger and McKean 1998; and the robust nonparametric methods vignette by Hettmansperger, McKean, and Sheather for overviews). Popular early techniques include the Mann-Whitney/Wilcoxon two-sample test, the Kruskal-Wallis one-way analysis-of-variance test, the Kendall tau and Spearman p tests for association, and the Kolmogorov-Smirnov tests on distribution functions. Later work extended the techniques. A few of the many important authors include Akritas and Arnold (1994), Patel and Hoel (1973), and Purl and Sen (1985), for linear models and analysis of variance, and Chakraborty and Chaudhuri (1999), Chaudhuri and Sengupta (1993), Choi and Marden (1997), Friedman and Rafsky (1979), Hettmansperger, Mottonen, and Oja (1997), Hettmansperger, Nyblom, and Oja (1994), Liu and Singh (1993), Puff and Sen (1971), and Randles (1989) for various approaches to multivariate one-, two-, and many-sample techniques.
A vibrant area of research today is extending the scope of nonparametric/robust testing procedures to more complicated models, such as general multivariate linear models, covariance structures, and independence and conditional independence models (see Marden 1999 for a current snapshot). More general problems benefit from the bootstrap and related methods. Davison and Hinkley (1997) provided an introduction, and Beran and Millar (1987) and Liu and Singh (1997) presented additional innovative uses of resampling.
Not everyone has been happy with the classical formulation. In particular, problems arise when trying to use the level or the p value to infer something about the truth or believability of the null. Some common complaints are as follows
1. The p value is not the probability that the null hypothesis is true. Gossett, and many practitioners and students since, have tried to use the p value as the probability that the null hypothesis is true. Not only is this wrong, but it can be far from reasonable. Lindley (1957) laid out a revealing paradox in which the p value is fixed at .05, but as the sample size increases, the Bayesian posterior probability that the null hypothesis is true approaches 1. Subsequent work (Berger and Sellke 1987; Edwards, Lindeman, and Savage 1963) showed that it is a general phenomenon that the p value does not give a reasonable assessment of the probability of the null. In one-sided problems, where the null is mu less than or equal to 0 and the alternative is mu > 0, the p value can be a reasonable approximation of the probability of the null. Because this is the situation that obtains in Student 1908, we can let Gossett off the hook. (see Casella and Berger 1987; Pratt 1965). Good (1987), and references therein, discussed other comparisons of p values and Bayes.
2. The p value is not very useful when sample sizes are large. Almost no null hypothesis is exactly true. Consequently, when sample sizes are large enough, almost any null hypothesis will have a tiny p value, and hence will be rejected at conventional levels.
3. Model selection is difficult. Instead of having just two hypotheses to choose from, one may have several models under consideration, such as in a regression model when there is a collection of potential explanatory variables. People often use classical hypothesis testing to perform stepwise algorithms. Not only are the resulting significance levels suspect, but the methods for comparing non-nested models are not widely used.
The Bayesian approach to hypothesis testing answers these complaints, at the cost of requiring a prior distribution on the parameter. It is helpful to break the prior into three pieces: the density 7.4 of the parameter conditional on it being in the alternative, the density 70 of the parameter conditional on it being in the null, and the probability pi[sub 0] that the null hypothesis is true. The prior odds in favor of the alternative are (1 - pi[sub 0])/pi[sub 0]. Then the posterior odds in favor of the alternative are a product of the prior odds and the Bayes factor; that is, the integrated likelihood ratio, Integral[sub OMEGA[sub A]] f[sub theta](x)gamma[sub A](theta)d theta/Integral of[sub OMEGA[sub 0]]f[sub theta](x)gamma[sub 0](theta) d theta. Thus the posterior probability that the null hypothesis is true can be legitimately calculated as 1/(1 + posterior odds). These calculations can easily be extended to model selection.
It is obvious the posterior odds depend heavily on pi[sub 0]. What is not so obvious is that the posterior odds also depend heavily on the individual priors gamma[sub A] and gamma[sub 0]. For example, as the prior gamma[sub A] becomes increasingly flat, the posterior odds usually approach 0. Jeffreys (1939) recommended "reference" priors gamma[sub A] and gamma[sub 0] in many common problems. These are carefully chosen so that the priors do not overwhelm the data. Other approaches include developing priors using imaginary prior data, and using part of the data as a training sample. Berger and Pericchi (1996) found that averaging over training samples can approximate so-called "intrinsic" priors. Kass and Wasserman (1996) reviewed a number of methods for finding priors. Choosing pi[sub 0] for the prior odds is more problematic. Jeffreys's suggestion is to take pi[sub 0] = 1/2, which means that the posterior odds equals the Bayes factor.
There is growing evidence that this Bayes approach is very useful in practice, and not just a cudgel for bashing frequentists (see Kass and Raftery 1995 and Raftery 1995 for some interesting applications and references.)
We should not discard the p value all together, but just be careful. A small p value does not necessarily mean that we should reject the null hypothesis and leave it at that. Rather, it is a red flag indicating that something is up; the null hypothesis may be false, possibly in a substantively uninteresting way, or maybe we got unlucky. On the other hand, a large p value does mean that there is not much evidence against the null. For example, in many settings the Bayes factor is bounded above by 1/p value. Even Bayesian analyses can benefit from classical testing at the model-checking stage (see Box 1980 or the Bayesian p values in Gelman, Carlin, Stern, and Rubin 1995).
As we move into the next millennium, it is important to expand the scope of hypothesis testing, as statistics will be increasingly asked to deal with huge datasets and extensive collections of complex models with large numbers of parameters. (Data mining, anyone?) The Bayesian approach appears to be very promising for such situations, especially if fairly automatic methods for calculating the Bayes factors are developed.
There will continue to be plenty of challenges in the classical arena. In addition to developing systematic collections of models and broadening the set of useful nonparametric tools, practical methods for finding p values for projection-pursuit (e.g., Sun 1991) and other methods that involve searching over large spaces will be crucial.
Akritas, M. G., and Arnold, S. F. (1994), "Fully Nonparametric Hypotheses for Factorial Designs, I: Multivariate Repeated Measures Designs," Journal of the American Statistical Association, 89, 336-343.
Anderson, T. W. (1984), An Introduction to Multivariate Statistical Analysis (2nd ed.), New York: Wiley.
Andersson, S. (1975), "Invariant Normal Models," The Annals of Statistics, 3, 132-154.
----- (1982), "Distributions of Maximal Invariants Using Quotient Measures," The Annals of Statistics, 10, 955-961.
Andersson, S. A., and Perlman, M. D. (1993), "Lattice Models for Conditional Independence in a Multivariate Normal Distribution," Annals of Statistics, 21, 1318-1358.
Barlow, R. E., Bartholomew, D. J., Bremner, J. M., and Brunk, H. D. (1972), Statistical Inference Under Order Restrictions, New York: Wiley.
Bartlett, M. S. (1937), "Properties of Sufficiency and Statistical Tests," Proceedings of the Royal Society of London, Ser. A, 160, 268-282.
Beran, R., and Millar, P. W. (1987), "Stochastic Estimation and Testing," The Annals of Statistics, 15, 1131-1154.
Berger, J. O., and Pericchi, L. R. (1996), "The Intrinsic Bayes Factor for Model Selection and Prediction," Journal of the American Statistical Association, 91, 109-122.
Berger, J. O., and Sellke, T. (1987), "Testing a Point Null Hypothesis: The Irreconcilability of p Values and Evidence" (with discussion), Journal of the American Statistical Association, 82, 112-122.
Birnbaum, A. (1955), "Characterizations of Complete Classes of Tests of Some Multiparametric Hypotheses, With Applications to Likelihood Ratio Tests," Annals of Mathematical Statistics, 26, 21-36.
Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975), Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.
Box, G. E. P. (1980), "Sampling and Bayes' Inference in Scientific Modelling and Robustness" (with discussion), Journal of the Royal Statistical Society, Ser. A, 143, 383-430.
Brown, L. D., Cohen, A., and Strawderman, W. E. (1976), "A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications," The Annals of Statistics, 4, 712-722.
----- (1980), "Complete Classes for Sequential Tests of Hypotheses," The Annals of Statistics, 8, 377-398, Corr. 17, 1414-1416.
Brown, L. D., and Marden, J. I. (1989), "Complete Class Results for Hypothesis Testing Problems With Simple Null Hypotheses," The Annals of Statistics, 17, 209-235.
Casella, G., and Berger, R. L. (1987), "Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem," Journal of the American Statistical Association, 82, 106-111.
Chakraborty, B., and Chaudhuri, P. (1999), "On Affine Invariant Sign and Rank Tests in One- and Two-Sample Multivariate Problems," in Multivariate Analysis, Design of Experiments, and Survey Sampling, ed. S. Ghosh, New York, Marcel Dekker, pp. 499-522.
Chaudhuri, P., and Sengupta, D. (1993), "Sign Tests in Multidimension: Inference Based on the Geometry of the Point Cloud," Journal of the American Statistical Association, 88, 1363-I 370.
Choi, K. M., and Marden, J. I. (1997), "An Approach to Multivariate Rank Tests in Multivariate Analysis of Variance," Journal of the American Statistical Association, 92, 1581-1590.
Cohen, A., and Sackrowitz, H. B. (1987), "Unbiasedness of Tests for Homogeneity," The Annals of Statistics, 15, 805-816.
Cox, D. R. (1961), "Tests of Separate Families of Hypotheses," Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 105-123.
Davison, A. C., and Hinkley, D. V. (1997), Bootstrap Methods and Their Application, Cambridge, U.K.: Cambridge University Press.
Diaconis, P. (1988), Group Representations in Probability and Statistics, Heyward, CA: Institute for Mathematical Statistics.
Eaton, M. L. (1970), "A Complete Class Theorem for Multidimensional One-Sided Alternatives," Annals of Mathematical Statistics, 41, 18841888.
Edwards, W., Lindeman, H., and Savage, L. J. (1963), "Bayesian Statistical Inference for Psychological Research," Psychological Review, 70, 193242.
Farrell, R. H. (1968), "Towards a Theory of Generalized Bayes Tests," Annals of Mathematical Statistics, 39, 1-22.
Fisher, R. A. (1925), Statistical Methods for Research Workers: London: Oliver and Boyd. (1935), Design of Experiments, London: Oliver and Boyd.
Friedman, J. H., and Rafsky, L. C. (1979), "Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests," The Annals of Statistics, 7, 697-717.
Gelman, A., Carlin, J., Stem, H., and Rubin, D. (1995), Bayesian Data Analysis, London: Chapman Hall.
Gibbons, J. D., and Chakraborti, S. (1992), Nonparametric Statistical Inference, New York: Marcel Dekker.
Good, I. J. (1987), Comments on "Testing a Point Null Hypothesis: The Irreconcilability of p Values and Evidence" by Berger and Sellke and "Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem" by Casella and Berger, Journal of the American Statistical Association, 82, 125-128.
Goodman, L. A. (1968), "The Analysis of Cross-Classified Data: Independence, Quasi-Independence, and Interactions in Contingency Tables With or Without Missing Entries," Journal of the American Statistical Association, 63, 1091-1131.
Haberman, S. J. (1974), The Analysis of Frequency Data, Chicago: University of Chicago Press.
Hettmansperger, T. P, and McKean, J. W. (1998), Robust Nonparametric Statistical Methods, London: Arnold.
Hettmansperger, T. P., Mottonen, J., and Oja, H. (1997), "Affine-Invariant Multivariate One-Sample Signed-Rank Tests," Journal of the American Statistical Association, 92, 1591-1600.
Hettmansperger, T. P., Nyblom, J, and Oja, H. (1994), "Affine Invariant Multivariate One-Sample Sign Tests," Journal of the Royal Statistical Society, Ser. B, 56, 221-234.
Hotelling, H. (1931), "The Generalization of Student's Ratio," Annals of Mathematical Statistics, 2, 360-378.
Jeffreys, H. (1939), Theory of Probability, Oxford, U.K.: Clarendon Press.
Kass, R. E., and Raftery, A. E. (1995), "Bayes Factors," Journal of the American Statistical Association, 90, 773-795.
Kass, R. E., and Wasserman, L. (1996), "The Selection of Prior Distributions by Formal Rules," Journal of the American Statistical Association, 91, 1343-1370.
Kiefer, J., and Schwartz, R. (1965), "Admissible Bayes Character of T2-, R[sup 2]-, and Other Fully Invariant Tests for Classical Multivariate Normal Problems," Annals of Mathematical Statistics, 36, 747-770. Corr. 43, 1742.
Lauritzen, S. L. (1996), Graphical Models, Oxford, U.K.: Clarendon Press.
Lehmann, E. L. (1959), Testing Statistical Hypotheses. New York: Wiley.
Lindley, D. V. (1957), "A Statistical Paradox," Biometrika, 44, 187-192.
Liu, R. Y., and Singh, K. (1993), "A Quality Index Based on Data Depth and Multivariate Rank Tests," Journal of the American Statistical Association, 88, 252-260.
----- (1997), "Notions of Limiting p Values Based on Data Depth and Bootstrap," Journal of the American Statistical Association, 92, 266277.
Mahalanobis, P. C. (1930), "On Tests and Measures of Group Divergence," Journal and Proceedings of the Asiatic Society of Bengal, 26, 541-588.
Marden, J. I. (1982), "Minimal Complete Classes of Tests of Hypotheses With Multivariate One-Sided Alternatives," The Annals of Statistics, 10, 962-970.
----- (1999), "Multivariate Rank Tests," in Multivariate Analysis, Design of Experiments, and Survey Sampling, ed. S. Ghosh, New York: Marcel Dekker, pp. 401-432.
Matthes, T. K., and Truax, D. R. (1967), "Test of Composite Hypotheses for the Multivariate Exponential Family," Annals of Mathematical Statistics, 38, 681-697. Corr. 38, 1928.
Mosteller, F., and Wallace, D. L. (1984), Applied Bayesian and Classical Inference: The Case of the Federalist Papers, New York: Springer.
Neyman, J., and Pearson, E. (1928), "On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference. Part I," Biometrika, 20A, 175-240.
----- (1933), "On the Problem of the Most Efficient Tests of Statistical Hypotheses," Philosophical Transactions of the Royal Society, Ser. A, 231,289-337.
Patel, K. M., and Hoel, D. G. (1973), "A Nonparametric Test for Interaction in Factorial Experiments," Journal of the American Statistical Association, 68, 615-620.
Pearson, K. (1900), "On the Criterion That a Given System of Deviations From the Probable in the Case of a Correlated System of Variables is Such That it can be Reasonably Supposed to Have Risen From Random Sampling," Philosophical Magazine, 1, 157-175.
Perlman, M. D. (1980), "Unbiasedness of Multivariate Tests: Recent Results," in Multivariate Analysis V, Amsterdam: North-Holland/Elsevier, pp. 413-432.
Perlman, M. D., and Olkin, I. (1980), "Unbiasedness of Invariant Tests for MANOVA and Other Multivariate Problems," The Annals of Statistics, 8, 1326-1341.
Pratt, J. W. (1965), "Bayesian Interpretation of Standard Inference Statements," Journal of the Royal Statistical Society, Ser. B, 27, 169-203.
Puri, M. L., and Sen, P. K. (1971), Nonparametric Methods in Multivariate Analysis, New York: Wiley.
----- (1985), Nonparametric Methods in General Linear Models, New York: Wiley.
Raftery, A. E. (1995), "Bayesian Model Selection in Social Research," in Sociological Methodology 1995, ed. P. V. Marsden, Oxford, U.K.: Blackwells, pp. 111-196.
Randles, R. H. (1989), "A Distribution-Free Multivariate Sign Test Based on Interdirections," Journal of the American Statistical Association. 84, 1045-1050.
Rao, C. R. (1947), "Large-Sample Tests of Statistical Hypotheses Concerning Several Parameters With Applications to Problems of Estimation," Proceedings of the Cambridge Philosophical Society, 44, 50-57.
----- (1973), Linear Statistical Inference and its Applications, New York: Wiley.
Robertson, T., Wright, F. T., and Dykstra, R. L. (1988), Order-Restricted Statistical Inference, New York: Wiley.
Roy, S. N. (1953), "On a Heuristic Method of Test Construction and Its Use in Multivariate Analysis," Annals of Mathematical Statistics. 24, 220-238.
Serfling, R. (1980), Approximation Theorems of Mathematical Statistics, New York: Wiley.
Stein, C. (1956), "The Admissibility of Hotelling's T2 Test," Annals of Mathematical Statistics, 27, 616-623.
Student (1908), "The Probable Error of a Mean," Biometrika, 6, 1-25.
Sun, J. (1991), "Significance Levels in Exploratory Projection Pursuit," Biometrika, 78, 759-769.
Wald, A. (1943), "Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large," Transactions of the American Mathematical Society, 54, 426-482.
----- (1950), Statistical Decision Functions, New York: Wiley.
Wermuth, N., and Cox, D. R. (1992), "Graphical Models for Dependencies and Associations," Proceedings of the Tenth Symposium on Computational Statistics, 1,235-249.
Whitaker, J. (1990), Graphical Models in Applied Multivariate Statistics, New York: Wiley.
Wijsman, R. A. (1967), "Cross-Sections of Orbits and Their Application to Densities of Maximal Invariants," Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 389-400.
~~~~~~~~
By John I. Marden
John I. Marden is Professor, Department of Statistics, University of Illinois
at Urbana-Champaign, Champaign, IL 61820 (E-mail: marden@stat.uiuc.edu).
Title: | Bayesian Inference With Missing Data Using Bound and Collapse. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Current Bayesian methods to estimate conditional probabilities from samples with missing data pose serious problems of robustness and computational efficiency. This article introduces a new method, called bound and collapse (BC) to tackle these problems. BC first bounds the possible estimates consistent with the available information and then collapses these bounds to a point estimate using information about the pattern of missing data. Deterministic approximations of the variance and of the posterior distribution are proposed, and their accuracy is compared to stochastic approximations in a real dataset of polling data subject to nonresponse. [ABSTRACT FROM AUTHOR] |
AN: | 4018236 |
ISSN: | 1061-8600 |
Database: | Business Source Premier |
Current Bayesian methods to estimate conditional probabilities from samples with missing data pose serious problems of robustness and computational efficiency. This article introduces a new method, called bound and collapse (BC) to tackle these problems. BC first bounds the possible estimates consistent with the available information and then collapses these bounds to a point estimate using information about the pattern of missing data. Deterministic approximations of the variance and of the posterior distribution are proposed, and their accuracy is compared to stochastic approximations in a real dataset of polling data subject to nonresponse.
Key Words: Bayesian inference; Gibbs sampling; Imputation; Missing data.
Missing data challenge statisticians since they can affect the randomness of a sample (Copas and Li 1997) thus removing the grounds for the use of most statistical methods. The impact of missing data on the randomness of a sample depends on the process responsible for the disappearance of some of the sample entries. The received theory for missing data goes back to Rubin (1976), who identified three cases: missing completely at random (MCAR); missing at random (MAR); and informatively missing (IM). The description of these three patterns is done by associating each sample variable with a dummy variable R[sub i], taking value 1 when the entry of the variable is missing and 0 otherwise. The probability distribution of each variable R[sub i] specifies the missing data model: data are MCAR if each R[sub i] is independent of all variables in the dataset; when the distribution of each R[sub i] is a function of the observed values in the dataset, data are MAR and are IM when the distribution of each R[sub i] is a function of observed and unobserved entries in the dataset. When data are either MAR or MCAR the missing data mechanism is ignorable (Rubin 1976) because inference does not depend on it, while for IM data the mechanism is not ignorable. A general treatment of missing data is typically very complex and several methods were described by Gelman, Carlin, Stem, and Rubin (1995), Little and Rubin (1987), and Schafer (1997). Relevant literature for incomplete multinomial data includes Cowell, Dawid, and Sebastiani (1996), Dickey, Jiang, and Kadane (1982), Jiang, Kadane, and Dickey (1992), Spiegelhalter and Lauritzen (1990) while Paulino and deB. Pereira (1992) and Paulino and deB. Pereira (1995) dealt with the specific problem of informative missing data.
This article focuses on the following situation. Let X[sub 1],...,X[sub k] and Y be categorical variables. We write X = (X[sub 1],...,X[sub k]), and denote by X = i (i = 1,...,r) and by Y = j (j = 1,...,c) the states of X and Y. We wish to estimate the conditional distributions of Y|X = i and the marginal distribution of Y from an incomplete sample S, in which the independent variable X is always observed and the response variable Y is subject to missing data. We augment the sample by adding the dummy variable R taking value 1 when the sample entry of Y is missing so that (X = i, R = 1) indicates an incomplete case. The three missing data mechanisms described above differ in the dependence of the probability of nonresponse p(R = 1|x, y) on X and Y. Data are MAR if p(R = 1|x, y) = p(R = 1|x). In this case, the observed values of Y are not representative of the complete sample as a whole, but they are so when considered within categories of X. When p(R = 1|x) = p(R = 1); that is, the probability of R = 1 is independent of X as well, data are MCAR. When the probability of R = 1 depends on Y and possibly on X, data are IM and the incomplete sample is no longer representative.
In all these cases, exact Bayesian analysis is computationally infeasible and one must resort to MCMC methods, such as Gibbs sampling (Geman and Geman 1984), data augmentation (Tanner and Wong 1987; Tanner 1996) exact Monte Carlo (Forster, McDonald, and Smith 1996; Smith, Forster, and McDonald 1996) or some other form of imputation (Gelman et al. 1995). These methods share two main drawbacks: (1) they both rely on the assumption that information about the missing data mechanism is available and this is not always the case; (2) they do not provide a measure of the uncertainty of the estimates to the assumption made on the missing data mechanism. Furthermore, the computational cost of Gibbs sampling and data augmentation is a function of the number of missing data and they become infeasible in samples with a large number of nonresponses. In imputation-based methods, the precision of the estimates increases as the number of simulated complete samples does, and the computational cost of each simulation is an increasing function of the number of variables about which data are missing.
This article introduces a new method, called bound and collapse (BC) to achieve three main objectives: (1) the definition of an estimation method from incomplete multinomial data with Dirichlet priors that is robust with respect to the pattern of missing data; (2) the identification of reliability measures able to account for the presence of incomplete cases; and (3) the development of efficient computational methods to perform these calculations. The intuition behind BC is that, with no information about the missing data mechanism, an incomplete sample is still able to bound the set of possible estimates within an interval defined by extreme distributions. Although the traditional Bayesian approach would model ignorance via noninformative priors (Bernardo and Smith 1994), we model the lack of information about the missing data mechanism by simply allowing the probabilities of nonresponse to range between 0 and 1. The inference is then a set of probability intervals representing the uncertainty due to nonresponse. The approach to modeling uncertainty via probability intervals follows the traditional work of Good (1962), Kyburg (1961), Dempster (1968) or the more recent work of Walley (1991) and, as discussed in Bernardo and Smith (1994), is the basis for a robust Bayesian approach to inference. If information about the missing data mechanism is available, it can be encoded in a proper probabilistic model of nonresponse and used to select a single estimate within these bounds. Thus, the second step of BC collapses each interval computed in the bound step to a single value via a convex combination of the extreme estimates with weights depending on the assumed pattern of missing data. Ignorable missing data mechanisms, such as MAR or MCAR, can be easily represented by modeling nonresponse from complete cases in the sample and, in these cases, BC returns a generalized version of the maximum likelihood estimates.
BC returns estimates of the probabilities of a multinomial model with a Dirichlet prior. In order to provide measures of variability of the estimates or inference about other quantities of interest, we need the posterior distribution. Since the true posterior distribution is a mixture of Dirichlet distribution, we propose to approximate the true posterior distribution with a Dirichlet to match the means and an approximation of the variance that can be computed using the BC estimates. In the particular case that data are MAR, the approximation returns the exact posterior distribution of the conditional probabilities of Y given X.
A feature of BC is the ability to represent explicitly and separately the information conveyed by the sample and the assumptions about the pattern of missing data. Bounds on the possible estimates represent uncertainty due to missing data and the width of intervals computed in the bound step can be regarded as a measure of the quality of the information on the estimates conveyed by the incomplete sample. By computing bounds, the uncertainty due to missing data is therefore retained in the analysis, so that sampling variability and uncertainty due to nonresponse are independently computed and separately represented. A further advantage of BC is its computational cost: for each conditional distribution, BC reduces the computational complexity of the analysis to one exact updating for each state of the response variable in the bound step, and a convex combination in the collapse step. The computational complexity of BC is a function only of the number of states of Y, and therefore its cost is independent of the number of missing data. Hence, BC allows different models for nonresponse to be efficiently evaluated, so that sensitivity of the conclusions to the assumed pattern of missing data can be quickly examined and marginal inference can be obtained easily by averaging out results computed for different models of nonresponse. Thus, BC provides a general framework for the sensitivity analysis approach to incomplete samples advocated, for instance, by Kadane (1993) and Kadane and Terrin (1997).
The remainder of this article is structured as follows. Section 2 reviews some theoretical and computational issues relevant to the development and the presentation of BC. Section 3 presents BC and describes inference methods based on BC. Section 5 illustrates features of BC using polling data from the 1992 British General Elections. Section 6 summarizes the article.
As long as the sample is complete, Bayesian conjugate analysis provides a simple way to estimate the conditional distributions of Y given X, as well as the marginal distribution of Y. We assume that (X, Y) has a joint multinomial distribution with probabilities theta[sub ij] = p(X = i,Y = j|theta), where theta = (theta[sub 11], theta[sub 12],...,theta[sub rc]) = (theta[sub ij]) (theta[sub ij] > 0, for all i and j, and SIGMA[sub ij] theta[sub ij] = 1) parameterizes the distribution. The standard conjugate prior adopted for theta is a Dirichlet distribution D(alpha), with alpha = (alpha[sub 11],...,alpha[sub rc]), alpha is greater than or equal to 0 for all i and j, and alpha = SIGMA[sub ij] alpha[sub ij] is the prior precision. Hence, the prior density is (Multiple lines cannot be converted in ASCII text). An advantage of using a Dirichlet prior is that the distributions of the parameters associated to the marginal distributions of X and Y, and to the r conditional distributions of Y|x = i are still Dirichlet (Fang, Kotz, and Ng 1990, p. 19). Write theta[sub i+] = SIGMA[sub j]theta[sub ij], theta[sub +j] = SIGMA[sub i]theta[sub ij], theta[sub j|i] = theta[sub ij]/theta[sub i+], and similarly let alpha[sub i+] = SIGMA[sub j]alpha[sub ij] and alpha[sub +j] = SIGMA[sub i]alpha[sub ij]. Then theta[sub I] = (theta[sub 1+],...,theta[sub r+]) is equivalent to D(alpha[sub I]), where alpha[sub I] = (alpha[sub 1+],...,alpha[sub r+]), theta[sub j] = (theta[sub +1],...,theta [sub +c]) is equivalent D(alpha[sub J]) with alpha[sub j] = (alpha[sub +1],...,alpha[sub +c]) and theta[sub J|i] = (theta[sub 1|i],...,theta[sub c|i]) is equivalent to D(alpha[sub j|i]), where alpha[sub J|i] = (alpha[sub I1],...,alpha[sub ic]). Furthermore theta[sub i] and theta[sub J|i] are marginally independent.
Suppose that we wish to infer theta from a random sample S = {s[sub 1],...,s[sub n]} of independent cases given theta. Data can be classified in a r x c contingency table, where n[sub ij] denotes the frequency of (X = i, Y = j) in S. Let n be the vector (n[sub 11],...,n[sub rc]). Hence, the likelihood function is p(S|theta) proportional to PI[sub ij]theta[sup n[sub ij, sub ij]] and, by conjugacy, the posterior distribution of theta remains Dirichlet: theta|S is equivalent to D(alpha[sub 11] + n[sub 11],...,alpha[sub rc] + n[sub rc]) is identical to D(alpha + n), with density function (Multiple lines cannot be converted in ASCII text). Thus, the posterior precision is alpha + n, and the posterior expectation of theta[sub ij] provides a point estimate of the joint probability of (X = i, Y = j), under a quadratic loss
[Multiple line equation(s) cannot be represented in ASCII text]
The fact that the joint posterior distribution of theta is still Dirichlet leads to a simple updating rule for the distributions of theta[sub I], theta[sub J], and theta[sub j|i]. Denote row and column totals by n[sub i+] = SIGMA[sub j] n[sub ij] and n[sub +j] = SIGMA[sub i] n[sub ij], and let n[sub I] = (n[sub i+]), n[sub J] = (n[sub +j]), n[sub J|i] = (n[sub il],..., n[sub ic]).Then theta[sub I]|S is equivalent to D(alpha[sub I] + n[sub I]), theta[sub J]|S is equivalent to D(alpha[sub j] + n[sub j], and theta[sub J|i]|S is D(alpha[sub J|i] + n[sub J|i]) for all i. Furthermore, theta[sub I]|S and theta[sub J|i]|S, for all i, are still marginally independent. Therefore, the Bayesian estimates of p(X = i) and p(Y = j|X = i) are
[Multiple line equation(s) cannot be represented in ASCII text] and [Multiple line equation(s) cannot be represented in ASCII text]
This simple analysis is precluded when the observed sample contains incomplete cases, because the amount of information available about Y and X is unbalanced. Suppose that some of the entries on the variable Y are reported as unknown. Let S = (S[sub o], S[sub m]), where S[sub o] and S[sub m] denote the sample with complete observations and the one with unknown entries on Y, respectively, and let m[sub i] be the frequency of incomplete cases (X = i, R = 1). Thus, n = SIGMA[sub ij] n[sub ij] is the number of cases completely observed, m = SIGMA[sub i]m[sub i] is the number of cases partially observed, and n + m is now the sample size. Following Little and Rubin (1987), we represent the incomplete sample in the r x (c + 1) contingency Table 1, in which the c+ 1th column contains the frequency of unknown cases for each category of X.
Let S[sub d[sub i]] be a possible distribution of the unclassified cases in S[sub m]. The exact posterior distribution of theta is a mixture of Dirichlet distributions weighted by the probabilities of possible completions of S: p(theta|S) proportional to SIGMA[sub d[sub i]] p(theta|S[sub d[sub i]], S[sub o])p(s[sub d[sub i]]|S[sub o]). The weights p(S[sub d[sub i]]|S[sub o]) can be computed if information about the missing data mechanism is available. Suppose that this information leads to the formulation of the probability of nonresponse
empty set[sub j|i] = p(Y = j|X = I, R + 1, empty set, theta),
where R = 1 continues to denote a nonresponse for Y. A complete Bayesian approach regards empty set as a random vector, with prior density p(empty set), so that the mixture weights of the exact posterior distribution are p(S[sub d[sub i]]|S[sub o] = integral of p(S[sub d[sub i]]|S[sub o], empty set, theta)p(empty set, theta|S[sub o)d empty set d theta. The assumed missing data mechanism shapes the probability p(S[sub d[sub i]]|S[sub o]) via empty set. If data are MAR, p(R = 1|X = i, Y = j, empty set, theta) = p(R = 1|X = i, empty set, theta) and empty set[sub j|i] = theta[sub j|i] so that p(S[sub d[sub i]]|S[sub o]) = integral of p(S[sub d[sub i]]|theta)p(theta|S[sub o])d theta. When data are IM, the probability of a missing observation on Y is generally a function of X and Y and p(S[sub d[sub i]]|S[sub o]) can be computed once a prior distribution on empty set is specified and the posterior distribution is a mixture over all possible completions of the sample. We note that some simplifications are still possible and some of the parameter independences are retained. The first simplification concerns the posterior distribution of theta[sub I] and theta[sub J|i] = (theta[sub J|1],...,theta[sub J|r]), and is given in the next theorem. The result is known, see for instance Forster and Smith (1998).
Theorem 1. Let S be an incomplete sample in which n[sub ij] is the frequency of observed cases (X = i, Y = j), and m[sub i] is the frequency of cases (X = i, R = 1). If theta is equivalent to D(alpha), the posterior distribution of theta[sub I] is D(alpha[sub 1+] + n[sub 1+] + m[sub 1],...,alpha[sub r+] + n[sub r+] + m[sub r]) is equivalent to D(alpha[sub I] + n[sub I] + m), and theta[sub I] and theta[sub J|I] are independent.
Thus, the Bayesian estimate of theta[sub i+] is
(2.1) theta{sub i+] = alpha[sub i+] + n[sub i+] + m[sub i]/alpha + n + m,
and the marginal independence of theta[sub I] and theta[sub J|I] is retained in the posterior distribution. When data are MAR, the joint distribution of theta[sub J|I] simplifies, and incomplete cases need not be taken into account in the inference about the conditional probabilities of Y|X. The next result is mentioned by Speigelhalter and Lauritzen (1990).
Theorem 2. Suppose that the missing data mechanism is MAR. Then the distribution of theta[sub J|I] factorizes into a product of independent Dirichlet distributions D(alpha J|i] + n[sub J|i]).
Thus, if data are MAR, the Bayesian estimates of theta[sub j|i] is
(2.2) theta[sub j|i] = alpha[sub ij] + n[sub ij]/alpha[sub i+] + n[sub i+].
The results in Theorems 1 and 2 exclude the possibility that the joint posterior distribution of theta is Dirichlet, when data are MAR. The marginal posterior distribution of theta[sub I] is D(alpha[sub I] + n[sub I] + m)). If the joint posterior distribution were a Dirichlet distribution, then alpha[sub i+] + n[sub i+] + m[sub i] would be the posterior precision of theta[sub J|i], but this is excluded by Theorem 2 which ensures that the posterior precision of theta[sub J|i] is alpha[sub i+] + n[sub i+]. Under the MAR assumption, however, it is possible to compute posterior means of theta[sub ij] and theta[sub +j] in closed form. We will use theta[sub ij] = theta[sub i+]theta[sub j|i] and theta[sub +j] = SIGMA[sup r, sub i=1]theta[sub i+]theta[sub j|i]. By Theorems 1 and 2, and the results in Wilks (1963, chap. 7.7), we know that theta[sub i+] is perpendicular to theta[sub j|i], theta[sub i+] is equivalent to D(alpha[sub i+] + n[sub i+] + m[sub i], alpha + n + m -n[sub i+] - m[sub i]), and theta[sub j|i] is equivalent to D(alpha[sub ij] + n[sub ij], alpha[sub i+] + n[sub i+] - alpha[sub ij] - n[sub ij]. Thus, it is easy to show that
(2.3) [Multiple line equation(s) cannot be represented in ASCII text]
so that incomplete cases are distributed across categories of Y|X = i according to the distribution of theta[sub J|i]. Note that (2.3) generalizes the MLEs given by Little and Rubin (1987) by adding the flattening constant alpha[sub ij]. However, there is no simple expression for the posterior distribution of theta and theta[sub +J], so that inference can be based on approximations using, for instance, MCMC methods (Tanner 1996) or moment-matching approximations (Titterington, Smith, and Makov 1985). We will discuss further moment-matching approximations in Section 4.
The theoretical framework described so far exploits the flexibility of the Bayesian approach to incorporate assumptions on the pattern of missing data. However, exact inference is possible only in a few particular cases, and often we must resort either to simplifying assumptions or to expensive approximate methods. The approach that we propose in the next section limits the amount of information about the missing data mechanism to gain computational efficiency.
Section 2 describes how the information on the missing data mechanism can be used to compute, in principle, the exact posterior distribution of theta. However, this information may not be available or may be extremely uncertain. Instead of modeling ignorance with noninformative priors, we model our ignorance by assuming only that the probabilities of nonresponse are constrained to be in the interval 0,1, and we show that the incomplete sample is still able to induce bounds on the possible estimates consistent with the information available, thus producing an interval based inference. When information about the missing data mechanism is available, it can be used to select a single estimate within the set of possible ones. This is the intuition behind BC: the method first bounds the possible estimates consistent with the available data, and then collapses the resulting intervals to point estimates via a convex combination of the extreme points, on the basis of the information about the missing data mechanism.
3.1 Bound
By Theorem 1, the posterior estimates (2.1) of theta[sub I] can be computed independently of the missing data mechanism. Uncertainty on the estimate of theta[sub ij] depends on the nonresponse and hence on the estimate of theta[sub j|i]. If no missing data mechanism is specified then, for fixed i, any value
(3.1) [Multiple line equation(s) cannot be represented in ASCII text]
is a possible estimate, consistent with the information available in the sample. For fixed i and j, (3.1) is maximized when m[sub ij] = m[sub i], so that the set of possible estimates of theta[sub j|i] is bounded above by
(3.2) theta[sub j|i,max] = alpha[sub ij] + n[sub ij] + m[sub i]/alpha[sub i+] + n[sub i+] + m[sub i]
The maximum probability theta[sub j|i,max] is obtained when all unclassified cases with X = i are assigned to Y = j, and the exact posterior distribution of theta[sub J|i] is D(alpha[sub i1] + n[sub i1],...,alpha[sub ij + n[sub ij],...,alpha[sub ic] + n[sub ic]). This distribution identifies a unique minimum probability of (Y = l|X = i) for all l is not equal to j:
theta[sub l|i,min] = alpha[sub il] + n[sub il]/alpha[sub i+] + n[sub i+] + m[sub i],
which is independent of j. Thus, the possible estimates of theta[sub j|i] are bounded as
(3.3) theta[sub j|i,min] = alpha[sub ij] + n[sub ij]/alpha[sub i+] + n[sub i+] + m[sub i] less than or equal to theta[sub j|i] less than or equal to alpha[sub ij] + n[sub ij] + m[sub i]/alpha[sub i+] + n[sub i+] + m[sub i] = theta[sub j|i,max].
From (3.3), we can also derive bounds on the possible estimates of theta[sub ij] and theta[sub +j]. By independence of theta[sub i+] and theta[sub j|i], it is easy to show that theta[sub i+]theta[sub j|i,min] less than or equal to theta[sub ij] less than or equal to theta[sub i+]theta[sub j|i,max]. Furthermore, E(theta[sub +j]|S) = SIGMA[sup r, sub i=1]theta[sub i+]E(theta[sub j|i]|S]) and this function is maximized when E(theta[sub j|i]|S) = theta[sub j|i,max] for all i, and it is minimized when E(theta[sub j|i]|S) = theta[sub j|i,min] for all i. Therefore bounds on theta[sub +j] are
(3.4) [Multiple line equation(s) cannot be represented in ASCII text]
for all j. Note that, since
theta[sub j|i,max] - theta[sub j|i,min] = m[sub i]/alpha[sub i+] + n[sub i+] + m[sub i],
the width of the probability interval returned by the bound step is constant for all j, and it is increasing in the number of unclassified cases. Hence, the interval theta[sub j|i,max] - theta[sub j|i,min] can be regarded as a direct measure of the information conveyed by the incomplete sample on theta[sub j|i] and a representation of the variability of the estimates as a function of the unreported cases for each category of X.
3.2 COLLAPSE
For each category of X, incomplete cases induce a set of extreme estimates corresponding to the most extreme situations in which data are systematically missing on one category of Y. Any assumption about the pattern of missing data will induce an estimate of theta[sub J|i]|S] within these bounds. Information about the missing data mechanism can be therefore used to identify a single estimate. This is the key idea of the collapse step. We assume that some external information on the missing data mechanism is available, from which we can deduce a probabilistic model for nonresponse as
(3.5) p(Y = j|R = 1, X = i, empty set, theta) = empty set[sub j|i],
where SIGMA[sub j]empty set[sub j|i] = 1, all i, and empty set = {empty set[sub j|i]}. The set empty set can be used to identify a point estimate within the probability interval (theta[sub j|i,min];theta[sub j|i,max]) via a convex combination of the extreme probabilities:
(3.6) theta[sub j|i]|empty set = empty set[sub j|i]theta[sub j|i,max] + (1 - empty set[sub j|i])theta[sub j|i,min] = alpha[sub ij] + n[sub ij] + empty set[sub j|i]m[sub i]/alpha[sub i+] + n[sub i] + m[sub i].
For fixed i and j in 1,...,c, (3.6) define a probability distribution since SIGMA[sub j]theta[sub j|i] = 1 for all i. Estimates (3.6) distribute the unclassified cases within categories of X across categories of Y according to the prior information about the pattern of missing data. Note that (3.6) can be rewritten as
theta[sub j|i]|empty set = alpha[sub i+] + n[sub i+]/alpha[sub i+] + n[sub i+] + m[sub i]alpha[sub ij] + n[sub ij]/alpha[sub i+] + n[sub i+] + m[sub i]/alpha[sub i+] + n[sub i+] + m[sub i]empty set[sub j|i],
which is a weighed average of the estimate of theta[sub j|i] obtained in the complete sample S[sub o] and the prior information empty set[sub j|i] with weights that depend on mi. As m[sub i] decreases, the sample estimate has more weight than empty[sub j|i], and, when m[sub i] = 0, theta[sub j|i]|empty set = (alpha[sub ij] + n[sub ij])/(alpha[sub i+] + n[sub i+]). Thus, when the sample is complete, (3.6) is the exact estimate E(theta[sub j|i]|S). As m[sub i] increases, theta[sub j|i]|empty set arrow right empty set[sub j|i] so that, coherently, nothing is learned from an empty sample.
Once theta[sub j|i]|empty set and theta[sub i+] are known, by independence of theta[sub I] and theta[sub J|I] the joint probability of (X = i, Y = j|S) can be written as E(theta[sub i+]|S)E(theta[sub j|i]|S), leading to
(3.7) theta[sub ij]|empty set = alpha[sub ij] + n[sub ij] + empty set[sub j|i]m[sub i]/alpha + n + m.
The marginal posterior probability of (Y = j) is then estimated as
(3.8) theta[sub +j]|empty set = alpha[sub +j] + n[sub +j] + SIGMA[sub i]empty set[sub j|i]m[sub i]/alpha + n + m.
If data are MAR, we have shown in Section 2 that empty set[sub j|i] = theta[sub j|i]. Thus, we can use the observed cases in the sample to estimate empty set as:
theta[sub j|i]|empty set = alpha[sub ij] + n[sub ij]/alpha[sub i+] + n[sub i+]
and simple algebra shows that theta[sub j|i]|empty set = empty set[sub j|i]. This is Estimate (2.2) in Section 2.
From a computational point of view, BC provides a deterministic method able to reduce the cost of estimating the conditional and marginal distributions of Y to the cost of one exact Bayesian updating and one convex combination for each category of Y in each category of X. The computational complexity of BC is therefore independent of the number of missing data and, being deterministic, does not pose convergence rate and detection problems that afflict iterative and stochastic methods.
The computational simplicity of BC provides an alternative way to incorporate uncertainty about empty set, that overcomes the computational limitation of a full Bayesian approach described in Section 2, in which empty set is treated as a random vector with prior density p(empty set). The collapse step can be regarded as a first order approximation of a full Bayesian analysis by letting (3.5) be the expectations of empty set[sub j|i]. A specified model p(empty set) for nonresponse yields point estimates of theta[sub j|i] that are the expected Bayesian estimates of theta[sub j|i] given empty set. By assuming different prior distributions for empty set, the sensitivity of the conclusions to the assumed pattern of missing data can be easily and quickly examined and marginal inference can be obtained by averaging out estimates for different empty set.
BC computes only estimates of the conditional, joint and marginal probabilities of Y, for a given empty set and complete inference requires the posterior distribution of these parameters. Once a model for the missing data has been chosen, direct Monte Carlo methods provide a way to make calculations on parameters of interest. This section describes a deterministic moment-matching approximation of the posterior distribution of theta[sub j|i], from which approximations of the distribution of other parameters can be derived. Although based on a very simple idea, empirical studies reported, for example, by Cowell, Dawid, and Sebastiani (1996), Speigelhalter and Lauritzen (1990), and Speigelhalter and Cowell (1992) support the hypothesis that moment-matching approximations of Dirichlet mixtures can be fairly accurate, and the example discussed in Section 5 will provide further evidence. The simple, deterministic nature of the approximation presented here allows one to perform, quickly and accurately, sensitivity analysis on parameters other than the posterior means.
Let theta[sub J|i]|empty set be the vector of BC estimates theta[sub j|i]|empty set and theta|empty set be the vector of BC estimates theta[sub ij]|empty set. For simplicity of notation, we will omit the conditional |empty set, although one should bear in mind that the BC estimates are conditional on an explicit model for the nonresponse. The basic idea of moment-matching is to approximate a mixture of Dirichlet distributions with a Dirichlet distribution matching some moments. If we set theta[sub J|i]|S, empty set is equivalent to D(alpha[sub i+]theta[sub J|i]) for some alpha[sub i+], this distribution will return BC estimates of theta[sub j|i] as E(theta[sub j|i]|S) = theta[sub j|i] independently of the posterior precision alpha[sub i+]. The posterior variances are a function of alpha[sub i+]
(4.1) [Multiple line equation(s) cannot be represented in ASCII text]
Now, we need to choose alpha[sub i+]. Note that V(theta[sub j|i]|S) is a decreasing function of alpha[sub i+] for fixed theta[sub j|i]. The maximum is for alpha[sub i+] = alpha[sub i+] + n[sub i] and this is the exact precision when data are MAR. The minimum variance is when (Multiple lines cannot be converted in ASCII text). This would be the exact precision if data were known to be systematically missing on one category of Y (given X.) In general, we can let alpha[sub i+] be alpha[sub i+] + n[sup i+] + km[sub i], where k is a weight assigned to an incomplete case, that measures the nonignorability of missing data. The value of k can also be regarded as the confidence in the prior specification of empty set. If k = 1, we weight an incomplete case as if it were a complete one. This would imply a strong prior confidence on the specified empty set. A value k < 1 yields a smaller posterior precision and hence a larger uncertainty on the estimates. The approximate posterior distribution allows us to derive an approximate marginal distribution of theta[sup j|i] as D(alpha[sub i+] + theta[sub j|i], alpha[sub i+](1 - theta[sub j|i)). If we now let theta[sub j|i] vary as a function of theta [sub j|i], we have the following Theorem.
Theorem 3. For fixed i and j let empty set[sup (11), sub j|i] > empty set[sup (2), sub j|i], and define theta[sup (h), sub j|i] = empty set[sup (h), sub j|i]theta[sub j|i,max] + (1 - empty set[sup (h), sub j|i]) theta[sub j|i,min] and theta[sup(h)][sub j|i] is equivalent to D(alpha[sub i+]theta[sup (h), sub j|i], alpha[sub i+](1 - theta[sup (h), sub j|i])). Then theta[sup (1), sub j|i] is stochastically larger than theta[sup(2), sub j|i], that is p(theta[sup (1), sub j|i] > v) is greater than or equal to p(theta[sup (2), sub j|i] > v for all v is an element of [0 1].
Proof: Let f[sub h](theta[sub j|i]) be the density function of theta[sup (h), sub j|i]. It is easy to show that the likelihood ratio
[Multiple line equation(s) cannot be represented in ASCII text]
is an increasing function of theta[sub j|i], from which we have that theta[sup(1), sub j|i] is larger than theta[sup (2), sub j|i] in the sense of likelihood ratio. This ordering implies the usual stochastic order (Ross 1996, p. 433). [Note: This proof is due to Alessandra Giovagnoli.] Thus, the fact that theta[sub j|i] is an increasing function of empty set[sub j|i] induces a set of ordered posterior distributions for theta[sub j|i] that gives an approximate sensitivity analysis with respect to empty set.
An equivalent direct approximation of the posterior distribution of theta as a Dirichlet does not seem to be possible, because of the lack of symmetry in the amount of information about theta[sub I], whose posterior precision is always alpha + n + m, and theta[sub J|i], whose posterior precision can also be alpha[sub i+] + n[sub i+] when data are MAR. However, from the approximate posterior variance of theta[sub j|i], we can still derive approximations of the posterior variance of theta[sub ij] and theta[sub +j, that can be used as kernel of moment-matching approximations of their marginal posterior distribution. Given that, by Theorem 1, theta[sub I]|S is equivalent to D(alpha[sub I + n[sub I] + m) and is independent of theta[sub J|I]|S, we use the approximate variance (Multiple lines cannot be converted in ASCII text) where
(4.2) [Multiple line equation(s) cannot be represented in ASCII text]
(4.3) [Multiple line equation(s) cannot be represented in ASCII text]
(4.4) [Multiple line equation(s) cannot be represented in ASCII text]
Note that V(theta[sub ij]|S) can be written as (Multiple lines cannot be converted in ASCII text). The marginal probability of Y = j is theta[sub +j] = SIGMA[sup r, sub i = 1]theta[sub ij] = SIGMA[sup r, sub i=1]theta[sub i+]theta[sub j|i] and, when data are IM, theta[sub j|i] and theta[sub j|h] are dependent. If we assume that theta[sub j|i] and theta[sub j|h] are uncorrelated, the posterior variance of theta[sub +j] can be approximated by
[Multiple line equation(s) cannot be represented in ASCII text]
and
[Multiple line equation(s) cannot be represented in ASCII text]
Thus, we have an approximation of the posterior variance of theta[sub +j] in terms of the BC estimates of theta[sub j|i]. The approximate variance is the exact variance when data are MAR while, in other cases, the accuracy of this approximation depends on the magnitude of the correlation between theta[sub j|i] and theta[sub j|h]. If the probabilities of nonresponse are independent for different categories of X, then the accuracy of the approximation is only dependent on the accuracy of the BC estimates. Simulation studies have shown that the approximation is very accurate for large samples. Examples will be given in Section 5.
The final task is to approximate the marginal posterior distribution of theta[sub ij] and theta[sub +j]. Consider first theta[sub ij]. A simple choice is to set theta[sub ij]|S, empty set is equivalent to D(alpha[sub ij1], alpha[sub ij2], where alpha[sub ij1], alpha[sub ij2] are chosen to match the marginal mean and variance of theta[sub ij] and hence
[Multiple line equation(s) cannot be represented in ASCII text] and [Multiple line equation(s) cannot be represented in ASCII text].
Under this choice, the posterior precision of theta[sub ij] is
[Multiple line equation(s) cannot be represented in ASCII text]
It can be easily shown that d alpha[sub ij]/d theta[sub j|i] is positive if and only if Theta[sub i+](alpha[sub i+] + 1) is greater than or equal to alpha[sub i+] + 1, and this condition is always satisfied. Thus, the posterior precision is an increasing function of theta[sub j|i] and hence of empty set[sub j/i]. This is reasonable since, for fixed i, increasing empty set[sub j/i] has the effect of assigning an increasing number of the m[sub i] nonresponse to Y = j. Analogously, the marginal posterior distribution of theta +[sub j] can be approximated by a D(alpha[sub +j1], alpha[sub +j2]), with hyperparameters chosen to match (3.8) and V(theta[sub +j]|S) given above. An alternative approximation, suitable for large samples, is to set theta[sub ij]|S, empty set is equivalent to N(theta[sub ij]; V(theta[sub ij]|S)). Under the MAR assumption, this is the asymptotic approximation of the posterior distribution based on the generalized MLE of the parameters, if the prior hyperparameters were all increased by 1 (Berger 1985).
This section illustrates the features of BC using the polling data examined by Forster and Smith (1998). We give a description of the dataset, and then we compute bounds on the estimates consistent with the data available and collapse them into three sets of estimates by assuming three nonresponse models. The goal of this section is twofold: to show how to present the inference provided by BC and to evaluate, empirically, the accuracy of the BC estimates and the moment-matching approximation of Section 4 by comparing them with estimates and approximations obtained via MCMC and imputation methods.
The task of the analysis is to provide some insight about the result of the political elections using the data extracted from the British General Election Panel Survey (Table 2.) The frequencies are classified according to Gender (X[sub 1]), taking values 1 = Male and 2 = Female; Social Class (X[sub 2]), coded as 1 = Professional, 2 = Managerial and Technical, 3 = Skilled, 4 = Semiskilled and Unskilled, 5 = Never Worked; and Voting Intention (Y), coded as 1 = Conservative, 2 = Labour, 3 = Liberal Democrat, 4 = Other. The total sample size is 1,242 cases, of which 375 cases do not record the Voting Intention. Following Forster and Smith (1998), we assume that Gender and Social Class are associated, and they both affect Voting Intention. We denote p(Y = j|X[sub 1] = 1, X[sub 2] = h, theta) by theta[sub j|ih]and assume a Perks prior distribution theta is equivalent to D(alpha), with alpha[sub ihj] = 1/40 (Good 1968), so that the total prior precision is 1. From the specification of the prior distribution of theta, we can derive theta[sub IH+] is equivalent to D(alpha[sub ih+]) and alpha[sub ih+] = 1/10, from which theta[sub I++] is equivalent to D(alpha[sub I++]) and alpha[sub i++] = 1/2 and theta[sub +H+] is equivalent to D(alpha[sub +H+]) and alpha[sub +h+] = 1/5. Furthermore theta[sub J|ih] is equivalent to D(alpha[sub J|ih) and alpha[sub j|ih] = 1/40.
5.2 BOUND AND COLLAPSE
Table 3 reports lower and upper bounds of the estimates of the conditional probabilities computed using (3.3), when no model for nonresponse is assumed. The last column is the width of the probability intervals of each conditional distribution. The last two rows give the lower and upper bounds on the estimates of the marginal probabilities of Y, computed as in (3.4). Once bounds have been estimated, the collapse step can be used to model different assumptions about the pattern of missing data.
5.2.1 Missing at Random
Suppose that data are MAR, so that the BC estimates of theta[sub j|ih] are the exact ones given in (2.2). From these quantifies, we estimate the joint probabilities theta[sub ihj], their exact standard errors as well as the marginal probabilities theta[sub ++j] of Voting Intention and relative standard errors using the approximation described in Section 4.
Table 4 reports estimates, standard errors, and 95% credible intervals for the marginal probabilities theta[sub ++j]. The credible intervals were computed using the moment-matching approximation of Section 4. To evaluate the goodness of this approximation, we generated a sample of 5,000 observations from the posterior distribution of theta[sub ++J] using the Gibbs Sampler implemented in Bugs5 (Thomas, Spiegelhalter, Gilks 1992) after a first burn-in of 1,000 observations. Figure 1 reports nonparametric estimates of the marginal posterior densities obtained from the sample generated by Bugs5 (dashed lines), and the densities obtained using the moment-matching approximation given in Section 4 (continuous lines). The density estimates were computed using the density estimation method implemented in S-Plus 4.0. The accuracy of the approximation is evident, and it is further confirmed by the equivalence of estimates and credible intervals computed by BC and by Bugs5 (Table 4). Credible intervals computed for the joint probabilities are also very similar to those found by Bugs5, even when the frequencies of complete cases are small. For instance, estimates of the joint probabilities of theta[sub 151],theta[sub 152], theta[sub 153] and theta[sub 154] are .007, .007, .002, .00003; 95% credible intervals computed using the moment-matching approximation described in Section 4 are (.003;.013), (.003;.013), (.0003;.007), and (.00;.005). Credible intervals computed with Bugs5 are (.003;.013), (.003;.013), (.0003;.007), and (.00;.003). The only significant difference between the two methods was the time spent for the computation of the estimates: on a Sparc Ultra 120Mhz, Bugs5 took more than 2 minutes, while our implementation of BC ran to completion is 96 milliseconds. As pointed out by a referee, however, faster stochastic calculations can be carried out using direct Monte Carlo methods and, with this data, the method yields essentially the same results at a lower computation cost (seconds against minutes).
5.2.2 Informative Nonresponse
It is believed that "nonrespondents to polls before the 1992 British General Election were more heavily pro-Conservative than respondents" (Forster and Smith 1998). Here we investigate two particular assumptions on the missing data mechanism and we compare the BC estimates with imputation-based ones. We first assume that nonrespondents are truly uncertain among the three major parties, and we model this assumption by setting
(5.1) j 1 2 3 4
empty set[sub j|ih] .32 .32 .32 .04
for all h, i. Thus, we assume a uniform pattern of nonresponse across categories of X[sub 1] and X[sub 2]. BC estimates of the conditional probabilities are easily computed by mixing the upper and lower bounds given in Table 3. From these estimates, we compute approximated posterior variances of theta[sub j|ih] by using (4.1) with k = 1. Estimates of the marginal probabilities theta[sub ++j], standard errors, and 95% credible intervals--based on the moment-matching approximation to a Dirichlet distribution--are reported in Table 5. In this analysis, we used the total number of complete and incomplete cases to approximate the posterior precision of theta[sub J|ih], so that the 95% credible intervals are smaller than those obtained under the MAR assumption. To avoid overestimation, a smaller proportion of the incomplete cases can be considered, as suggested in Section 4.
The accuracy of the BC estimates was compared to imputation-based estimates obtained by completing 1,000 times the incomplete sample by generating the missing entries from the probability of nonresponse empty set. In each completed sample, the exact estimates of the conditional and marginal probabilities theta[sub j|ih] and theta[sub ++j] were computed by using the standard conjugate analysis described in Section 2. Final estimates and standard errors were then taken as means and standard errors of the 1,000 estimates generated by the simulation. Empirical 95% credible intervals were computed by using 2.5% and 97.5% quantiles (Table 5). This comparison reveals the extreme accuracy of BC estimates and of the moment-matching approximation proposed in Section 4. The accuracy of the approximation is further confirmed in Figure 2 that plots the approximate posterior densities of the parameters theta[sub ++j] obtained by moment-matching (continuous lines) and nonparametric estimates of the marginal posterior densities of theta[sub ++j] obtained from the sample generated via imputation (dashed lines).
Consider now another pattern of nonresponse implementing the Silent Conservative effect (Butler and Kavanagh 1992)--that is, the assumption that nonrespondents are pro-Conservative. We assume a uniform pattern of nonresponse across categories of X[sub 1] and X[sub 2] and we set
(5.2) j 1 2 3 4
empty set[sub j|ih] .41 .28 .28 .03
for all h, i. The effect on the marginal probability of Voting Intention is shown in Table 6 and Figure 3 plots the approximate posterior densities of the parameters theta[sub ++j] (continuous lines: moment-matching approximation; dashed lines: imputation based non-parametric density estimates). The results confirm again the accuracy of the BC estimates and of the moment-matching approximation.
5.3 DISCUSSION
A summary of the inference produced with BC is displayed in Figure 4, that reports probability intervals corresponding to the 10 combinations of X[sub 1] and X[sub 2] categories, and the point estimates computed in the collapse step under the three missing data mechanisms. Continuous lines refer to males, while dotted lines refer to females. Stars represent point estimates computed in the collapse step, under the MAR assumption. Points denoted by g and s represent BC estimates conditional on the two informative patterns of nonresponse given in Equations (5.1) and (5.2), respectively.
The width of the probability intervals gives a measure of the amount of uncertainty among nonrespondents and of the effect of the variables Gender and Social Class on the amount of nonresponse. Intervals are tighter for males than for females (mean width is .2629 for males and .3405 for females). Semiskilled and unskilled males (X[sub 2] = 4) are the least uncertain with a clear Labour preference. Most uncertain are males who never worked (X[sub 2] = 5). Among females, semiskilled and unskilled are the most uncertain, followed by professional females, while females who never worked are the least uncertain. Therefore, if we accept that data are either MAR or MCAR, bounds would suggest that data are MAR rather than MCAR. An open question is whether it is possible to use the intervals' width to derive some statistical test to discriminate between the MAR and MCAR assumption.
The amount of overlapping among intervals can help to single out the categories to be addressed more effectively during the political campaign. For example, the uncertainty due to nonresponse is going to have little effect on the Tory prevalence of professional males, or on the Labour prevalence of semiskilled males, as shown by the non overlapping intervals. On the other hand, it is impossible to predict whether managerial (X[sub 2] = 2), skilled (X[sub 2] = 3), and males who never worked (X[sub 2] = 5) will vote Conservative or Labour, unless we make some assumption about the pattern of nonresponse. The assumption that data are MAR leads to predict an evident Conservative prevalence among managerial males, while the Conservative advantage is reduced under the two informative nonresponse models. Similarly, the advantage of Labour over Conservative among skilled males, predicted under the MAR assumption, becomes negligible if nonresponse is intentional. The voting intention of females in all social classes is an open question, as shown by the overlapping intervals. Again, enforcing the MAR assumption results in predicting a strong advantage of the Conservative party in managerial and skilled females, and a Labour advantage in semiskilled and never worked females. Changing the assumption on the pattern of nonresponse weaken significantly these differences.
Figure 5 plots BC estimates of the marginal probability of Voting Intention. The uncertainty on the marginal probabilities shows essentially that data do not provide enough evidence to predict the victory of one party. If we assume that nonresponse is intentional, the results of the political election could be a surprise as a victory of the Labour party, or even of Liberal Democrat, cannot be excluded, although Tories seem to be ahead. In the follow up survey reported by Forster and Smith (1998), a sample of 1,242 individuals were asked the party they had voted and 21 did not respond, 86 claimed not to have voted, and of the remaining 1,135, 44.1% voted Conservative, 32.2% voted Labour, 21.0% voted Liberal Democrat, and 2.82% other. Thus, the MAR assumption would have produced an overestimation of preferences for the Conservative and Labour parties, and an underestimation of preferences for the Liberal Democrat Party, while the Silent Conservative effect would have produced more accurate figures.
Three main objectives were set at the beginning of this article: the definition of an estimation method from incomplete samples robust with respect to the pattern of missing data, the identification of reliable measures able to account for the presence of missing data in a sample, and the development of efficient computational methods to perform these calculations. BC provides a methodological framework within which these goals can be achieved.
The basic intuition behind BC is that information about the incomplete sample and exogenous knowledge about the pattern of missing data should be kept separately. This assumption naturally produces a two-step method: bound and collapse. The bound step extracts from the sample all the available information and returns a set of possible estimates consistent with the incomplete sample. This step provides, as a by-product, a measure of the reliability of the estimates with respect to the amount of information actually conveyed by the incomplete sample about each parameter of interest. The second step of BC uses exogenous information available on the missing data mechanism to select single estimates within the sets defined by the first step. These estimates are weighted averages of estimates computed from the complete sample and probabilities of nonresponse. When data are MAR, BC returns the exact Bayesian estimates. Under a generic missing data mechanism, BC estimates are the expected Bayesian estimates given the nonresponse model empty set.
BC provides only estimates of the posterior means of multinomial data with Dirichlet priors. Credible intervals can be computed using the moment-matching approximation proposed in Section 4. The results of Section 5 have shown the accuracy of the approximation in three examples. An open question is to evaluate--through extensive empirical work--the general accuracy of the approximation.
BC is an estimation procedure able to encode different assumptions about the pattern of missing data and to quickly evaluate the sensitivity of the estimates to different assumptions about the nonresponse model. Marginal inference can be obtained by averaging out estimates obtained under different nonresponse models. From a computational point of view, BC provides a deterministic method able to reduce the cost of estimating the conditional and marginal distributions of Y to the cost of one exact Bayesian updating and one convex combination for each category of Y in each category of X. The computational complexity of BC is therefore independent of the number of missing data and, being deterministic, BC does not pose the problems of convergence rate and detection afflicting iterative and stochastic methods currently used for the analysis of incomplete samples.
Simplicity and efficiency make this method a powerful tool for the analysis of incomplete multinomial data, fostering the application of principled statistical methods to real-world problems.
This research was supported by the ESPRIT programme of the Commission of the European Community under contract EP29105. We thank two anonymous referees, the associate editor, and the editor for their useful comments that helped to improve the original manuscript.
Legend for Chart: B - Y C - Y D - Y E - R A B C D E X 1 . . . C 1 1 n[sub 11] . . . n[sub 1c] m[sub 1] . . . . . . . . . . . . r n[sub r1] . . . n[sub rc] m[sub r]
Legend for Chart: A - X(1) B - X(2) C - Y 1 D - Y 2 E - Y 3 F - Y 4 G - m[sub 1] A B C D E F G 1 1 26 8 7 0 11 2 87 37 30 6 64 3 66 77 23 8 77 4 14 25 15 1 12 5 6 6 2 0 7 2 1 1 1 0 1 2 2 63 34 32 2 68 3 102 52 22 4 77 4 10 32 10 2 38 5 20 25 8 2 19 Voting Intention (Y): 1 = Conservative, 2 = Labour, 3 = Liberal Democrat, 4 = Other; Gender (X[sub 1]): 1 = Male, 2 = Female; Social Class (X[sub 2]): 1 = Professional, 2 = Managerial and Technical, 3 = Skilled, 4 = Semiskilled and Unskilled, 5 = Never Worked.
Legend for Chart: A - X[sub 1] B - X[sub 2] C - Y 1 D - Y 2 E - Y 3 F - Y 4 G - Width A B C D E F G 1 1 .4995 .1540 .1348 .0005 .2112 .7107 .3652 .3460 .2116 2 .3883 .1652 .1340 .0269 .2856 .6739 .4508 .4196 3125 3 .2629 .3068 .0917 .0320 .3067 .5696 .6134 .3983 .3386 4 .2090 .3730 .2239 .0153 .1789 .3879 .5518 .4028 .1941 5 .2855 .2855 .0960 .0012 .3318 .6173 .6173 .4277 .3329 2 1 .2010 .2010 .0049 .2010 .3921 .5931 .5931 .3971 .5931 2 .3165 .1709 .1608 .0102 .3416 .6581 .5124 .5024 .3517 3 .3968 .2024 .0857 .0157 .2995 .6963 .5018 .3852 .3151 4 .1088 .3477 .1088 .0220 .4125 .5214 .7603 .5214 .4346 5 .2702 .3377 .1083 0273 02567 .5267 .5941 3647 .2837 theta[sub +j] .3180 .2391 .1201 .0211 .3017 .6197 .5408 .4218 .3228
Legend for Chart: B - 1 C - 2 D - 3 E - 4 A B C D E BC theta[sub ++j] .4531 .3446 .1717 .0306 s.e. .0167 .0162 .0128 .0058 95% CI (.4182;.4881) (.3121;.3779) (.1470;.1980) (.0202;.0431) MCMC theta[sub ++j] .4527 .3447 .1718 .0307 s.e. .0168 .0157 .0127 .0059 95% CI (.4206;.4860) (.3141;.3755) (.1476;.1973) (.0202;.0431)
Legend for Chart: B - 1 C - 2 D - 3 E - 4 A B C D E BC theta[sub +j] .4145 .3357 .2166 .0332 s.e. .0083 .0075 .0057 .0021 95% CI (.3983;.4308) (.3211 ;.3505) (.2055;.2279) (.0292;.0374) Imputation theta[sub +j] .4145 .3358 .2165 .0332 s.e. .0074 .0071 .0072 .0031 95% CI (.4000;.4290) (.3220;.3494) (.2029;.2303) (.0276;.0396)
Legend for Chart: B - Y 1 C - Y 2 D - Y 3 E - Y 4 A B C D E BC theta[sub +j] .4417 .3236 .2045 .0302 s.e. .0086 .0073 .0055 .0020 95% CI (.4248;.4586) (.3094;.3380) (.1938;.2154) (.0264;.0340) Imputation theta[sub +j] .4408 .3233 .2045 0304 s.e. .0075 .0069 .0069 .0027 95% CI (.4266;.4556) (.3107;.3365) (.1909;.2182) (.0251;.0360)
GRAPH: Figure 1. Comparisons of Marginal Posterior Density Functions of Voting Intention (BC: continuous lines; Gibbs sampling: dashed lines) When Data are MAR.
GRAPH: Figure 2. Comparisons of Marginal Posterior Density Functions of Voting Intention (BC: continuous lines; imputation: dashed lines) When Data are Assumed to be IM and the Probability of Nonresponse is Assumed to be empty set[sub j|ih] = .32 for all i, j, and h.
GRAPH: Figure 3. Comparisons of Marginal Posterior Density Functions of Voting Intention (BC: continuous lines; imputation: dashed lines) When Data are Supposed to be IM and the Probability of Nonresponse is Assumed to be empty set[sub j|ih] = .41, .28, .28, .03 for all i,h.
GRAPH: Figure 4. BC estimates of the probability of Voting Intention (C = Conservative; L = Labour; LD = Liberal Democrat; O = Other) conditional on the gender of the respondent (continuous lines: Males; dotted lines: Females) and the five social classes. Stars represent point estimates computed in the collapse step, under the MAR assumption. Points denoted by g and s represent BC estimates conditional on two informative patterns of nonresponse in (5.1) and (5.2).
GRAPH: Figure 5. BC Estimates of the Marginal Probability of Voting Intention (C = Conservative; L = Labour; LD = Liberal Democrat; O = Other). Stars represent point estimates computed in the collapse step, under the MAR assumption. Points denoted by g and s represent BC estimates conditional on two informative patterns of nonresponse in (5.1) and (5.2).
Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis (2nd ed), New York: Springer-Verlag.
Bernardo, J. M., and Smith, A. F. M. (1994), Bayesian Theory, New York: Wiley.
Butler, D., and Kavanagh, D. (1992), The British General Election of 1992, New York: St. Martins's Press.
Copas, J. B., and Li, H. G. (1997), "Inference in Non-random Samples" (with discussion), Journal of the Royal Statistical Society, Set. B, 55-95.
Cowell, R. G., Dawid, A. P., and Sebastiani, P. (1996), "A Comparison of Sequential Learning Methods for Incomplete Data," in Bayesian Statistics 5, Oxford: Oxford University Press, pp. 533-542.
Dempster, A. P. (1968), "A Generalization of Bayesian Inference," Journal of the Royal Statistical Society, Ser. B, 30, 205-247.
Dickey, J. M., Jiang, J. M., and Kadane, J. B. (1982), "Bayesian Methods for Censored Categorical Data," Journal of the American Statistical Association, 82, 773-781.
Fang, K. T., Kotz, S., and Ng, K. W. (1990), Symmetric Multivariate and Related Distributions, London: Chapman and Hall.
Forster, J. J., and Smith, P. W. F. (1998), "Model-Based Inference for Categorical Survey Data Subject to Non-ignorable Non-response" (with discussion), Journal of the Royal Statistical Society, Ser. B, 60, 57-70.
Forster, J. J., McDonald, J. W., and Smith, P. W. F. (1996), "Monte Carlo Exact Condidional Tests for Log-Linear and Logistic Models," Journal of the Royal Statistical Society, 58, 445-453.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman and Hall.
Geman, S., and Geman, D. (1984), "Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741.
Good, I. J. (1962), "Subjective Probability as a Measure of a Non-Measurable Set," in Logic, Methodology and Philosophy of Science, eds. E. Nagel, P. Suppes, and A. Tarsky, Stanford, CA: Stanford University Press, pp. 319-329.
-----(1968), The Estimation of Probability: An Essay on Modern Bayesian Methods, Cambridge, MA: MIT Press.
Jiang, J. M., Kadane, J. B., and Dickey, J. M. (1992), "Computation of Carlson's Multiple Hypergeometric Function R," Journal of Statistical Computation and Graphics, 1, 231-251.
Kadane, J. B. (1993), "Subjective Bayesian Analysis for Surveys With Missing Data," The Statistician, 42, 415-426.
Kadane, J. B., and Terrin, N. (1997), "Missing Data in the Forensic Context," Journal of the Royal Statistical Society, Ser. B, 160, 351-357.
Kyburg, H. E. (1961), Probability and the Logic of Rational Belief, Middletown: Wesleyan University Press.
Little, R. J. A., and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York: Wiley.
Paulino, M. D. C., and deB. Pereira, C. A. (1992), "Bayesian Analysis of Categorical Data Informatively Censored," Communications in Statistics, Part A--Theory and Methods, 21, 2689-2705.
-----(1995), "Bayesian Analysis of Categorical Data Under Informative General Censoring," Biometrika, 82, 439-446.
Ross, S. M. (1996), Stochastic Processes, New York: Wiley.
Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592.
Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman and Hall.
Smith, P. W. F., Forster, J. J., and McDonald, (1996), "Monte Carlo Exact Tests for Square Contingency Tables," Journal of the Royal Statistical Society, 159, 309-321.
Spiegelhalter, D. J., and Cowell, R. G. (1992), "Learning in Probabilistic Expert Systems," in Bayesian Statistics 4, Oxford: Clarendon Press, pp. 447-466.
Spiegelhalter, D. J., and Lauritzen, S. L. (1990), "Sequential Updating of Conditional Probabilities on Directed Graphical Structures," Networks, 20, 157-224.
Tanner, M. A. (1996), Tools for Statistical Inference (3rd ed.), New York: Springer Verlag.
Tanner, M. A., and Wong, W. H. (1987), "The Calculation of Posterior Distributions by Data Augmentation" (with discussion), Jornal of the American Statistical Association, 82, 528-550.
Thomas, A., Spiegelhalter, D. J., and Gilks, W. R. (1992), "Bugs: A Program to Perform Bayesian Inference Using Gibbs Sampling," in Bayesian Statistics 4, Oxford: Clarendon Press, pp. 837-842.
Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985), Statistical Analysis of Finite Mixture Distributions, Chichester: Wiley.
Walley, P. (1991), Statistical Reasoning with Imprecise Probabilities, London: Chapman and Hall.
Wilks, S. S. (1963), Mathematical Statistics, New York: Wiley.
[Received February 1999. Revised July 1999.]
~~~~~~~~
By Paola Sebastiani and Marco Ramoni ewton C. Shiboski
Paola Sebastiani is Assistant Professor, Department of Mathematics and
Statistics, University of Massachusetts, Lederle Graduate Research Tower, Box
34515, Amherst, MA 01003 (E-mail: sebas@math.umass.edu). Marco Ramoni is
Instructor, Hospital Informatics Program, Harvard Medical School, 350 Longwood
Avenue, Boston, MA 02115 (E-mail: marco_ramoni@harvard.edu).
Title: | Asymmetric Shocks among U.S. States. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Shows the application of a factor model to the study of risk sharing among states in the United States. Disentangle movements in output and consumption due to national, regional, or state-specific business cycles from those due to measurement error; Substantial amount of interstate risk sharing due to the presence of measurement error in output; International business cycles. |
AN: | 3915983 |
Database: | Business Source Premier |
Abstract: This paper applies a factor model to the study of risk sharing among U.S. states. The factor model makes it possible to disentangle movements in output and consumption due to national, regional, or state-specific business cycles from those due to measurement error. The results of the paper suggest that some findings of the previous literature which indicate a substantial amount of interstate risk sharing may be due to the presence of measurement error in output. When measurement error is properly taken into account, the evidence points towards a lack of interstate smoothing.
JEL classification: E20, E32, F36
Key words: intranational business cycles, risk sharing, factor models
I Introduction
Are intranational business cycles different from international business cycles? Is there more risk sharing within a country or among countries? The trend towards trade and capital market integration observed in the past twenty years makes these questions very relevant for international macroeconomics. Indeed, the study of intranational business cycles may shed light on the future patterns of international co-movements, assuming that such a trend will continue. As a result, a growing body of literature has investigated these questions since the beginning of the nineties.[1]
The policy implications of this literature are far reaching. If risk sharing is one of the beneficial effects of a global capital market, opening internal capital markets to foreign capital may increase macroeconomic stability in the long run. For Europe in particular, the comparison between an established monetary union (United States) and a nascent one (EMU) is often used as a tool to judge the likelihood of success of the latter.[2]
Hess and Shin (1998) provide an interesting study of intranational business cycles within the United States.[3] Using data on retail sales of non-durables for nineteen U.S. states from 1978 to 1992 they show that the so-called "quantity anomaly", i.e., the finding that de-trended consumption is less correlated across countries than output (see Backus, Kehoe, and Kydland 1992), holds true at the intranational level as well. This result is interpreted as evidence of lack of risk sharing among U.S. states, since under perfect risk sharing consumption should be perfectly correlated across states.
The results of Hess and Shin are starkly at odds with those obtained by Asdrubali, Sorensen, and Yosha (1996). In an influential paper, Asdrubali et al. find that the amount of risk sharing among U.S. states is substantial: about 75% of output shocks are smoothed via either capital and credit markets or the federal government. Their finding was later confirmed by studies of Crucini (1998) and Melitz and Zumer (1999). While Hess and Shin and Asdrubali et al. reach opposite conclusions on the degree of inter-state risk sharing, their results are not directly comparable. Hess and Shin's data set includes only nineteen states, while the study of Asdrubali et al. includes all fifty states. Several of the thirty-one states not included in Hess and Shin's analysis, being oil producing or agricultural states, are generally subject to more risk than the nineteen considered by Hess and Shin. Perhaps more importantly, Hess and Shin's results are based on the study of cross-state correlations in consumption and output. The study of cross correlations has a serious limitation, especially when applied to state level data. Given that consumption data is likely to be measured with error, cross-state consumption correlations may be low for reasons other than lack of risk sharing.
This paper contributes to the study of intranational risk sharing in the United States in two ways. The first contribution consists in expanding Hess and Shin's data set both cross-sectionally and in the time dimension. Using a different source of data for consumption of non-durables, the paper can reproduce Hess and Shin's finding for all fifty states from 1969 to 1995. The second contribution consists the application of a factor model to the study of risk sharing. The factor model makes it possible to disentangle movements in the data due to shocks in "true" consumption and output from those purely due to measurement error. The paper finds that when measurement error in the data is taken into account, the "quantity anomaly" still holds for U.S. states, contradicting the conclusions of Asdrubali et al. In essence, the results of this paper indicate that some of the smoothing found by Asdrubali et al. may be simply shedding of measurement error in output, and not actual risk sharing.
The remaining of the paper is as follows. Section 2 discusses the model. Section 3 illustrates the data. Section 4 describes the findings of the paper, and section 5 concludes.
2 The model
This section describes the factor model which is used to analyze fluctuations in relative per capita output and consumption at the state level. By definition a change in relative per capita output or consumption in a given state implies that per capita output or consumption in that particular state does not move in synchrony with aggregate per capita output or consumption, so I will refer to these changes as "asymmetric shocks". The factor model considered here differs from the standard factor model due to the restrictions imposed in order to identify the model. The restrictions are as follows: for each state, changes in relative output and consumption are assumed to depend on nation-wide shocks (U.S. business cycle), on a regional shocks (regional business cycle), on a state-specific shocks (state-specific business cycle). The identification restrictions therefore consist of a set of zero restrictions on the matrix of coefficients: the impact of a regional business cycle shock in a given region is constrained to be zero for states which do not belong to that region, and the impact of a state-specific shock on the consumption or output of other states is also zero by assumption.[4]
The model can be described as follows. Let the variables c[sub it] and y[sub it] represent de-trended and de-meaned relative per capita consumption and output for state i in period t (i = 1,.,n; t = 1,.., T). Specifically, if C[sub it] and C[sup us][sub it] denote per capita consumption in state i and in the US, respectively, c[sub it] represents the quantity log(C[sub it]) - log(C[sup us][sub it]), de-trended and de-meaned.[5] The variable c[sub it] is referred to as an "asymmetric shock" in consumption, as it measures the extent to which log(C[sub it]) and log(C[sup us][sub it]) do not move in unison. The same definition applies to y[sub it]. If state i belongs to region r (r = 1, .., r), c[sub it] and y[sub it] are affected by the nation-wide shock f[sup us][sub t], by the regional shock f[sup r][sub t], and by the state-specific shock f[sup i][sub t]. In addition, consumption is affected by a purely idiosyncratic shock e[sub it], which reflects preference shocks and/or measurement error. Relative consumption and output in each state can be affected differently by national, regional, and state specific-shocks, i.e., the exposures are not constrained to be the same. Formally, the model is as follows:
[Multiple line equation(s) cannot be converted to ASCII text] (1)
where the betas denote the exposures of relative consumption and output in state i to the different factors (national - us, regional - r, etc.), and where the identifying restrictions on the exposures are as follows:
beta[sup r][sub yi] = beta[sup r][sub ci] = 0 if state i does not belong to region r
beta[sup j][sub yi] = beta[sup j][sub ci] = 0 for all j not equal to i.
The factors are, by construction, uncorrelated with each other and with the idiosyncratic shocks, and the idiosyncratic shocks are also uncorrelated with each other:
[Multiple line equation(s) cannot be converted to ASCII text] (3)
for all t, all r, all i, all s not equal to r, all j not equal to i. It is also assumed, as in the standard factor model, that all variables are normally distributed. In particular, for all t, r, and i,
[Multiple line equation(s) cannot be converted to ASCII text] (4)
The assumption that factors have unitary variance is purely a normalization assumption: the different variances of U.S., regional, and state-specific business cycles are reflected in the different magnitudes of the parameters beta's.
Model (1) allows for measurement error in consumption but not in output. The state-specific factor indeed coincides with the shocks to output in state i, after taking into account national and regional business cycle shocks. This assumption is made because with only two observations for each state (consumption and output) it is not possible to separately identify state-specific and idiosyncratic shocks in output. The assumption of no measurement error in output is implicitly made in some of the previous literature as well (Asdrubali et al. and Melitz and Zumer): as these authors use output as a regressor, the presence of substantial measurement error in output would indeed imply that their estimates contain a bias. While these assumption is generally thought to be reasonable, as output is likely to be better measured than consumption, the results in section 4 show that it may not be correct.
When two different measures for consumption and output are available, the factor model can be used to quantify the amount of measurement error attributable to both output and consumption. Let us call y[sup 1][sub it] and y[sup 2][sub it], and c[sup 1][sub it] and c[sup 2][sub it], the two available measures of relative output and consumption in state i at time t, respectively. Shocks to y[sup 1][sub it] and y[sup 2][sub it] (c[sup 1][sub it] and c[sup 2][sub it]) are the result of shocks in the "true" measure of relative output (consumption), which is unobservable, and of measurement error. Formally:
[Multiple line equation(s) cannot be converted to ASCII text] (5)
where y*[sub it] and c*[sub it] are the "true" measures of output and consumption, respectively. As in model (1), I assume that movements to y*[sub it] and c*[sub it] can be attributed to national, regional, and state-specific shocks, as well as preference shocks in the case of consumption. Under these assumptions, model (1) can be extended as follows:
[Multiple line equation(s) cannot be converted to ASCII text] (6)
where the identifying restrictions on the betas, as well the distributional assumptions, are the same as in model (1). The term gamma[sub ci]f[sup ci][sub t] allows for the presence of preference shocks in consumption, and for the possibility that the measurement errors in the consumption data are correlated. The term gamma[sub yi]f[sup yi][sub t] allows for the possibility that the measurement errors in the output data are correlated. The relative accuracy of the different data sets can be assessed by comparing the standard deviations of the idiosyncratic errors, epsilon[sup y1][sub it] and epsilon[sup y2][sub it] for output, and epsilon[sup c1][sub it] and epsilon[sup c2][sub it] for consumption.
As discussed in the next section, two different measures of non durable consumption are not available for all fifty states. Two measures of output are however available. If we modify model (6) as follows:
[Multiple line equation(s) cannot be converted to ASCII text] (7)
we obtain a model that still allows for measurement error in both consumption and output, and can be estimated for all fifty states. The only substantial difference between models (6) and (7) is that the second model will not provide any information on the importance of preference shocks in consumption, as opposed to pure measurement error. The appendix describes the details of the maximum likelihood estimation of models (1), (6), and (7).[6]
From any of the three factor models described above, it follows that the standard deviation of asymmetric shocks to "true" output and consumption in state i can be expressed as:
[Multiple line equation(s) cannot be converted to ASCII text] (8)
where, in order to simplify the notation, we denote by r the region to which state i belongs. Asymmetric shocks may arise from differences among states in the exposure to the U.S. business cycle, as well as from regional and state-specific business cycles.[7]
The approach developed in this section presents a number of advantages over simple correlations as a tool for studying co-movements among output and consumption across states. First of all, this methodology can quantify the size of asymmetric shocks. The correlation between two variables only conveys information on the extent to which the shocks affecting the variables are orthogonal to each other, but says nothing of the magnitude of the shocks. More importantly, the factor model makes it possible to disentangle asymmetries in consumption and output due to "true" asymmetric shocks, as opposed to asymmetries due to measurement error or preference shocks.[8]
The approach adopted here also presents an advantage over that followed by Asdrubali et al. and Melitz and Zumer. In essence, these authors assess the amount of interstate risk sharing via a panel regression of relative consumption on relative output. Their approach takes fully into account measurement error in consumption, the regressand, but not in output, the regressor. Models (6) and (7) have the advantage that they can allow for measurement error in both consumption and output. In addition, the decomposition of consumption and output fluctuations into national, regional, and state specific business cycles can be helpful in identifying the sources of imperfect risk sharing.
Two important assumptions underlie the model. The first assumption is that the model is not dynamic, in that the factors are assumed to be uncorrelated over time. Several papers (see Stockman 1988, and Costello 1993) neglect serial correlation when dealing with annual data. Since the model is applied to relative output and consumption, serial correlation in the data is even less of a problem, as discussed in the next section.[9] The second assumption is that the parameters are not time-varying. As the productive structure of the states has changed over time, so have in principle the exposures to national, regional, and state-specific business cycles. The limited time series dimension of the sample (at most 26 observations) makes it hard to deal with this issue. Therefore, I follow most previous work, and do not allow for time-varying parameters.
The literature on risk sharing among U.S. states has used different data sets for state consumption and output. As discussed in the introduction, the literature has reached different conclusions on the extent to which U.S. states share risk. In order to assess whether this is the result of differences in the data sets, as opposed to differences in the methodology, this paper uses four data sets, which are described in table ??.
The first data set -data set I (HS)- is, essentially, the same one used by Hess and Shin. As a measure of real output, Hess and Shin use data on real gross state product (gsp) from the Bureau of Economic Analysis (BEA). The real gsp data are available since 1977 only. They are obtained by deflating nominal gsp by a gsp deflator. The latter is a weighted average of national producer prices, where the weight of each commodity is given by its production share in each state. In terms of the consumption data, Hess and Shin argue that evidence on the "quantity anomaly" should be obtained using data on consumption of non-durables, as this variable is a better empirical counterpart to the theoretical definition of consumption used in Backus et al. (1992) than total consumption, which includes consumption of durables. As a measure of non-durable consumption Hess and Shin use retail sales of non-durables from the Bureau of the Census. These data are available only from 1978 to 1995 for nineteen states, and are no longer produced. In order to obtain real consumption, Hess and Shin deflate the data using the gsp deflator from the BEA. This paper does not follow their choice for two reasons. First, the prices used in the deflator are national prices, and do not take into account price differences within the United States.[10] Secondly, the share of any particular commodity in production is likely to be different from the share in consumption, particularly for oil producing and agricultural states. Instead, nominal consumption data are here converted into real terms using state CPI data, which are described below. This difference does not affect the results in terms of cross-state correlations in consumption and output, as discussed in the next section. In spite of this minor difference, I will refer to this data set as the HS data set.
The second data set -data set 2 (ASY)- is the one used by Asdrubali et al. Asdrubali et al. use as output measure the nominal gsp from the BEA. Consumption is measured as the sum of total private consumption and state and local government consumption. Total private consumption by state is obtained by multiplying total retail sales by state for the ratio of total private U.S. consumption over total U.S. retail sales for the corresponding year. Total retail sales by state are obtained from Sales&Marketing Management (the data are proprietary, and I am grateful to Sales&Marketing Management for giving me permission to use them). State and local government consumption is constructed following the definition given in their paper. Asdrubali et al. assume that there are no differences in either gsp or consumption deflator among U.S. states. Under this assumption, the presence of fixed time effects in their estimation procedure implies that the regressions can be run using nominal data, as the common deflator would be washed out by the fixed time effects. This is the case also in the factor model used here, since it applies to relative consumption and output. When computing cross-state correlations in consumption and output, however, I deflate nominal consumption by the U.S. CPI, obtained from the Bureau of Labor Statistics, and nominal gsp by the U.S. gdp deflator, obtained from the BEA. The time period used by Asdrubali et al. is 1963-1990. For comparison with the other data sets I use the time period 1969-1995. The data are available for all 50 states.
The third data set -data set 3- extends the Hess and Shin data set both cross-sectionally and in the time dimension. Like in the data set 1 (HS), nominal consumption is measured as non-durable retail sales. Since the Census data are available for nineteen states only, and only from 1978, we use Sales&Marketing Management data, which are available for all 50 states from the 1930s. Non-durable retail sales are constructed as the difference between total retail sales and retail sales of automobiles, furniture, building materials and hardware.[11] Nominal output is measured as nominal gsp from the BEA. Both consumption and output are deflated using state CPI data. The CPI series are constructed using American Chamber of Commerce Association data on Cost of Living by metropolitan areas, as well as other sources, and are weighted for each state using BEA data on population by metropolitan area. The CPI data are constructed from 1969 to 1995, which implies that the real output and consumption series are available for this period only (details on the construction of the CPI series can be found in Del. Negro 1998b). When the consumption and output series are deflated using the U.S. CPI and by the U.S. gdp deflator respectively, instead of the state CPI, I obtain very similar results, which I do not report.
Since it may not be appropriate to deflate the nominal output series by the CPI, given that output and consumption baskets differ, I also present the results for a forth data set -data set 4- in which real output is measured as real gsp from the BEA (same source as in data set 1). Real consumption is measured as in data set 3, which makes it possible to include all 50 states in the data set. This data set covers a shorter time span than data set 3 (1978-1995), given the constraint on the availability of real gsp data.
In all data sets retail sales are used as a proxy for consumption, given that no data for state level consumption is available. Retail sales is an imprecise measure of consumption, both because it does not incorporate consumption of services, and because it may include purchases made by residents of other states.[12] However, Hess and Shin show that at the aggregate level Census retail sales are a good proxy for consumption, especially at the annual frequency.
In all data sets, the data are transformed in per capita terms using the population data from the BEA. The definition of regions used in the factor model follows the BEA. The BEA regions are New England, Mid East, Great Lakes, Plains, South East, South West, Rocky Mountains, and Far West.
As mentioned in the previous section, the factor model adopted here ignores serial correlation in the data. Table 2 provides a justification for this assumption, as it shows that the average first order serial correlation in relative consumption is nil, and is significantly different from zero at the 5% level for less than 12% of all states. The autocorrelation in relative output is higher, but with the exception of data set 2 (ASY) it is still significantly different from zero only for less than 26% of all states. For data set 2 (ASY) the first order serial correlation coefficient is significant for, at most, 40% of all states. Some authors (for instance Stockman 1988) overcome the issue of serial correlation by running a regression using the residuals from an AR process. Given the fact that the AR coefficients are imprecisely estimated, I chose not to do so, as this procedure may introduce considerable measurement error.
The data are de-trended using two different methods: log-differences (growth rates) and Hodrick-Prescott (HP) filtering. Given that the frequency of the data is annual, in applying the HP filter the smoothing parameter is set to 10 (see Baxter and King 1999).
4 The results
The discussion of the results begins with an analysis of cross-correlations for U.S. states, and continues with the description of the results of the factor models.
Table 3 displays the average cross-state correlation of consumption and output, the difference between the two, and the percentage of observations for which the correlation in output is larger than the correlation in consumption. The results are displayed for all four data sets described in the previous section, and for both detrending methods. Fig. i displays the cross-correlations of output and consumption across U.S. states for all four data sets. Specifically, the plots in Fig. 1 display all the pairs (Corr(y[sub it], y[sub jt]), Corr(c[sub it], c[sub jt])), with output correlations on the horizontal axis, and consumption correlations on the vertical axis. While Fig. 1 shows the results for the log-differenced data only, the results for the HP-detrended data are very similar, as can be seen from table 3.
The main message from table 3 and from Fig. 1 is that the "quantity anomaly" holds for U.S. states regardless of the data set. For data set 1 (HS), 94% of the observations are below the 45 degree line, implying that for almost all states consumption correlations are lower than output correlations. The average consumption correlation is between .3 and .33, and the average output correlation is between .7 and .78, depending on the de-trending method. These figures are roughly the same ones reported by Hess and Shin. For the other data sets the evidence in favor of the "quantity anomaly" is not as stark, as the introduction of thirty-one more states in the sample, many of which are agricultural and oil producing, causes a decrease in output correlations. Yet for all data sets the correlation in output is higher than the correlation in consumption for at least three quarters of the observations: the "quantity anomaly" holds for U.S. states as well as for countries. However, if low consumption correlations are due to preference shocks or measurement error, no meaningful inference can be made about risk sharing. The models described in section 2 provide a better tool to analyze the data, as they separate out measurement error in consumption and output.
Table 4 analyzes the standard deviations of asymmetric shocks in consumption and output for all data sets and detrending methods, obtained from model (1). In particular, the table shows the average standard deviations of asymmetric shocks in consumption and output across states, the difference between the two, and the percentage of observations for which the standard deviation of asymmetric shocks in consumption is larger than the standard deviation of asymmetric shocks in output, considering all states, and considering only those states for which the difference is significantly different from zero at the 95% level (in parenthesis). For each data set and detrending method, the table displays the results including (first line) and excluding (second line) idiosyncratic shocks in consumption. Note that asymmetric shocks to consumption, when purged from measurement error and/or preference shocks, are by construction related to shocks in output. One can then gauge the amount of the inter-state smoothing of output shocks by comparing the standard deviation of asymmetric shocks in output and consumption. It is important to bear in mind that the results shown in table 4 are obtained assuming that output is measured without error.
Three facts emerge from table 4. The first fact is that whenever the idiosyncratic component of consumption is included, the analysis based on correlation and the analysis based on standard deviations of asymmetric shocks deliver the same result: output is more correlated across states (less asymmetric) than consumption. For all data sets and detrending methods the standard deviation of asymmetric shocks in output is less than the standard deviation of asymmetric shocks in consumption for at least 86% of states.
The second fact is that taking idiosyncratic shocks into account makes a substantial difference. The average difference between the asymmetric standard deviation in consumption and output is halved for data set 1 (HS), more than halved for data set 4, and is completely reversed for data sets 2 (ASY) and 3. One can appreciate graphically in Fig. 2 the effect of excluding the idiosyncratic component of consumption. Fig. 2 plots the pairs of standard deviations of asymmetric shocks for non-durable consumption and output for all four data sets, with (left column) and without (right column) the idiosyncratic component of consumption (Fig. 2 focuses on log-differenced data; the plots for HP-filtered data are similar). For all the observations that lie below the 45 degree line the standard deviation of asymmetric shocks in consumption is larger than the standard deviation of asymmetric shocks in output. The starred observations are those for which the two are different at the 5% significance level. For data set 2 (ASY) and 3 the effect of excluding the idiosyncratic component is very evident: while for the plots in the left column the vast majority of the observations lies to the right of the 45 degree line, in the corresponding plots in the right column most states (and most significant observations) flip over the other side of the line.
The third fact from table 4 is that there are important differences among the data sets in terms of the implications for interstate risk sharing. For data sets 1 (HS) and 4 the "quantity anomaly" clearly holds, even after taking into account measurement error and preference shocks in consumption. For data set 1 (HS) the asymmetric standard deviation in output is smaller than the one in consumption for all observations for which the difference is significant. For data sets 2 (ASY) and 3, conversely, the asymmetric standard deviation in output is greater than the one in consumption for all observations for which the difference is significant (with the exception of HP-filtered data for data set 3). For data set 2 (ASY), in the case of log-differenced data (the one considered by Asdrubali et al.), about a third of output shocks are smoothed. This figure is not nearly as large as the 75% suggested by Asdrubali et al., perhaps because of the differences in the methodologies, but suggests a considerable amount of smoothing. Under different detrending methods and data sets the amount of smoothing is not as large.
What is driving these differences? There are three possible explanations for the divergence in the results: the number of states included in the data (nineteen versus fifty), the time period (1978-1995 versus 1969-1995), and the data sources. Differences in the number of states and in the time period explain some of the divergence in the results.[13] Much of this divergence, however, is due to differences in the data sources. When the model is estimated for data set 3 using the same time period and the same set of states as Hess and Shin, the degree of inter-state smoothing is found to be significant (the results are not shown for lack of space).
As the differences in the data sources appear to matter, one is left with the question of which data set is most reliable. Since model (1) allows for measurement error in consumption, it is unlikely that differences in the measurement of consumption are driving the results. Rather, one should focus on measurement error in output. In this regard, the data sets can be compared by looking at the standard deviation of y[sub it] for the same states and the same time period. For log-differenced data, the average standard deviation of y[sub it] for all fifty states in the 1978-1995 time period is 2.61%, 2.67%, and 2.23% for data sets 2(ASY), 3, and 4, respectively. For HP-filtered data, the corresponding figures are 1.7%, 1.75%, and 1.37%. These numbers point at substantial differences in the measurement of output across data sets, and imply that one needs to address the issue of measurement error in output in order to properly estimate the amount' of inter-state risk sharing. Models (6) and (7) can be used to address this issue. For the nineteen states for which the BEA nondurable consumption measures are available, we can use model (6) to compare the Hess and Shin measures of output and consumption (data set 1) with those in data sets 2 (ASY) or 3.[14] Model (6) cannot be estimated for all 50 states, since only one measure of nondurable consumption is available, the one from Sales&Marketing Management. Using model (7) we can still compare the measure of output used by Hess and Shin - the real output from the BEA (data set 4)- with the nominal BEA output figures, deflated using either the US gdp deflator (data set 2-ASY) or the state-level CPI (data set 3).
Table 5 shows that measurement error in output explains the differences in the results shown in table 4. Table 5 displays the average across states of the estimated standard deviations of asymmetric shocks to consumption and output, with and without measurement error.[15] Once measurement error in both consumption and output is taken into account, the results are consistent across data sets: for at least 78% of the states the standard deviation in asymmetric shocks in consumption is larger than the standard deviation of asymmetric shocks in output. Fig. 3 complements table 5 graphically: the figure plots the standard deviations of asymmetric shocks to consumption and output, with and without measurement error, for the log-differenced data only. For all data sets, the vast majority of observations lies below the 45 degree line, both before and after considering measurement error in the data. It is important to remark that only for a handful of states the difference between the standard deviation of asymmetric shocks in consumption and output is significantly positive. The evidence towards dis-smoothing is weak. At the same time, there is no evidence at all pointing towards smoothing of asymmetric shocks in output. In this sense, the results unequivocally point towards a lack of risk sharing among states.
A comparison of tables 4 and 5 reveals that the results from model (1) do not always coincide with those of models (6) and (7). The average standard deviation of "true" asymmetric shocks in consumption for data set 4 should in principle be the same in table 4 and in table 5. Yet in table 4 the estimated average standard deviation of asymmetric shocks in consumption for data set 4, without measurement error, is 2.55% for log-differenced data. The corresponding measure in table 5 is 3.23% when data set 4 is paired with data set 2 (ASY), or 3.3% when is paired with data set 3. Similar differences arise for HP-filtered data. These differences are likely to be due to the very short sample (17 observations for each state), which implies that the parameters are imprecisely estimated. However, this problem does not arise for other data sets. For data set 1, for instance,, the average standard deviation of "true" asymmetric shocks in consumption is roughly the same in tables 4 and 5. Since the results in terms of risk sharing are robust across data sets in table 5, the differences between tables 4 and 5 may not be much of a concern.
Table 6 provides information on the source of asymmetric shocks. For each of the measures of consumption and output used in the estimation, table 6 displays the cross sectional average of the estimated standard deviation of asymmetric shocks due to each factor (national, regional, state-specific, common measurement error, and idiosyncratic measurement error) as well as the percentage of states for which this is significantly different from zero at the 5% level.[16] In order to understand the sources of asymmetric shocks it is useful to focus on the estimates from data sets 2 (ASY) and 4, and 3 and 4, as they involve all fifty states. Estimates involving nineteen states only may not correctly identify the role of national and regional business cycles.
As far as output is concerned, national, regional, and state-specific shocks have roughly the same importance. On average, the asymmetric standard deviation of asymmetric shocks due to national business cycles is about 1% for log-differenced data, and is significantly different from zero at the 5% level for almost half of the states. For HP-filtered data, its importance relative to regional and state-specific business cycles is slightly less, but is still significantly different from zero for a third of the states. This finding implies that national business cycles have a significantly different impact across states. The finding is in contrast to the results obtained by Blanchard and Katz (1992) using employment data. Regional business cycles are as, or slightly more, important than national shocks, depending on the de-trending method, and are also significant for over 40% of states. Regional business cycles may reflect the geographical pattern of industry composition across states. The impact of state-specific shocks is as large as the impact of regional shocks in terms of magnitude, but is much more imprecisely estimated. As far as consumption is concerned, state-specific shocks seem to play a more important role than either national or regional shocks. The impact of state-specific business cycles on consumption is more than twice as large as its impact on output, indicating that the dis-smoothing with respect to state-specific shocks may be one of the major causes of the lack of risk sharing. In general, these coefficients are also very imprecisely estimated. If we focus only on those states for which the difference in the standard deviations of asymmetric shocks in consumption and output is significantly positive, the exposure of consumption to state-specific shock is large and precisely estimated. For those few states, however, the standard deviation of the idiosyncratic measurement error in consumption is estimated to be very small, which is surprising given that on average it is estimated to be large, above .87%. This suggests that for these states a large idiosyncratic movement in consumption may be mistaken for a state-specific shock.[17] In summary, the evidence suggesting that asymmetric shocks in consumption are larger than asymmetric shocks in output is questionable. At the same time, there is no evidence at all that points towards inter-state smoothing of asymmetric shocks in output, regardless of their source.
The figures in table 6 also help to explain why the different data sets used in table 4 sometimes lead to opposite conclusions in terms of the "quantity anomaly". The explanation lies in measurement error in output, which is sizable for data sets 2 (ASY) and 3. Model (1) allowed only for measurement error in consumption. When measurement error in consumption was taken into account in the computation of the standard deviation of asymmetric shocks, the standard deviation of asymmetric shocks in consumption decreased, but the standard deviation of asymmetric shocks in output by construction remained the same. For those data sets for which measurement error in output is relatively small, like data sets 1 and 4, the standard deviation of asymmetric shocks in consumption remained above that of output for most states. But for those data sets for which measurement error in output is large, like data sets 2 (ASY) and 3, eliminating measurement error only in consumption resulted in a reversal of the ranking.
Tables 6 is also informative in regard to the magnitude of preference shocks. The term gamma[sub ci]f[sup ci][sub t] in model (6) represents both common measurement error in consumption, and preference shocks. Under the assumption that measurement error and preference shocks are uncorrelated, the standard deviation of asymmetric shocks due to gamma[sub ci]f[sup ci][sub t]- the second-to-last column in table 6 -represents an upper bound on the standard deviation of asymmetric shocks due to preference shocks. Table 6 shows that the standard deviation of asymmetric shocks due to the term gamma[sub ci]f[sup ci][sub t] is fairly small on average and not significantly different from zero in all but a few cases.
In conclusion, when measurement error in both consumption and output is properly taken into account, asymmetric shocks in consumption are as large as asymmetric shocks to output for all but a few states, pointing towards a lack of inter-state smoothing. This finding is consistent with that of Hess and Shin, who show that for the nineteen states included in their analysis the "quantity anomaly" appears to hold: output is more correlated across states than consumption. The analysis of Hess and Shin, which is based on cross correlations, is affected by the problem of measurement error. However, their approach treats measurement error in output and consumption symmetrically. In contrast, the approach of Asdrubali et al. takes full account of measurement error in consumption, but does not allow for measurement error in output. According to the results of this paper, this may be why Asdrubali et al. find substantial risk sharing and Hess and Shin find none.
5 Conclusions
The paper uses a factor model to analyze asymmetric shocks to consumption and output across U.S. states. The factor model represents a particularly useful tool for the analysis of state-level data as it makes it possible to disentangle movements in output and consumption due to national, regional, or state-specific business cycles from those due to measurement error. Given that measurement error is likely to be substantial for state level data, this approach has an edge over those used in the existing literature.
The results of the paper suggest that the findings of Asdrubali et al. (1996) and Melitz and Zumer (1999) indicating a substantial amount of inter-state risk sharing may be due to the presence of measurement error in output: part of the smoothing of output shocks found by those authors may simply represent shedding of measurement error, and not actual risk sharing. In general, the presence of measurement error in output implies that theft methodology may not be appropriate, since the use of output as a regressor may yield biased estimates of the amount of smoothing.
When measurement error in both consumption and output is properly taken into account, asymmetric shocks in consumption are as large as asymmetric shocks in output for all but a few states, pointing towards a lack of inter-state smoothing.
These results open a number of questions for future research. Firstly, this paper does not investigate the role of capital markets, credit markets, and the federal government, in smoothing or dis-smoothing asymmetric shocks to output. It would be interesting to repeat the analysis performed in the seminal paper of Asdrubali et al. taking into account measurement error in output. Secondly, the apparent lack of risk sharing at the state-level is a puzzle that needs to be explained. Recent works of Coval and Moskowitz (1997), Huberman (1999), and Hess and Shin (1999) provide direct and indirect evidence of financial market segmentation within the United States. Yet, one would think that the degree of domestic financial market segmentation is much smaller than the segmentation at the international level. U.S. states share the same currency, language, laws, accounting standards, and federal government. The lack of risk sharing intranationally is even more puzzling than the lack of risk sharing internationally.
The author gratefully acknowledges financial support from ISFSE (CNR). The paper draws from the first chapter of the author's Ph.D. dissertation at Yale University, and he thanks his advisor Christopher A. Sims for invaluable help. Comments by Charles Engel and two anonymous referees greatly helped to improve the paper. Suggestions by Stefan Krieger, Jacques Melitz, Francesc Obiols-Homs, Christopher Otrok, Bent Sorensen, and seminar participants at the 1999 Winter Meetings of the Econometric Society, the 1999 Midwest Macroeconomic Conference, and the 1999 Royal Economic Society Conference are also acknowledged. The author is also grateful to Alejandro Ponce R. for excellent research assistance. The views expressed here are the author's and not necessarily those of the Federal Reserve Bank of Atlanta or the Federal Reserve System. Any remaining errors are the author's responsibility.
Acknowledgments
The paper draws from the first chapter of my Ph.D. dissertation at Yale University. I would like to thank my advisor Christopher A. Sims for invaluable help. Comments of Charles Engel and two anonymous referees greatly helped to improve the paper. Suggestions by seminar participants at the 1999 Winter Meetings of the Econometric Society, the 1999 Midwest Macroeconomic Conference, the 1999 Royal Economic Society Conference, Stefan Krieger, Jacques Melitz, Francesc Obiols-Homs, Christopher Otrok, and Bent Sorensen are also acknowledged. I am grateful to Alejandro Ponce R. for excellent research assistance. All mistakes are my own.
A Maximum likelihood estimation of the factor model
The appendix describes the details of the maximum likelihood estimation of the factor model. The factor model can be written in the general form:
x[sub t] = Bf[sub t] + epsilon[sub t], t = 1, .., *[this character cannot be represented in ASCII text] (9)
where x[sub t] is a n x 1 vector of data at time t, f[sub t] is a n x I vector of factors, epsilon[sub t] is a n x 1 vector of idiosyncratic shocks, and B is a n x k matrix of parameters. Both the factors and the idiosyncratic shocks are normally distributed:
Multiple line equation(s) cannot be represented in ASCII text
where Phi is a diagonal matrix. The model (9) encompasses models (1), (6), and (7). In order to check for the robustness of the results, I use two different approaches to the maximum likelihood problem, the EM algorithm and a Newton-Raphson routine.
The EM algorithm was first applied to factor models by Lehmann and Modest (1985). This is a generalization of their approach to the case in which a set of linear restrictions is applied to the matrix B. The EM algorithm follows the intuition that if the factors were observable all the parameters could be estimated by means of OLS. The algorithm is an iterative procedure that consists of two steps (see also Gelman et al. 1995). For a given value of the parameters obtained at the end of the q[sup th] iteration of the algorithm, that is, (B[sup q], Phi[sup q]), the first step involves taking the expectation (E) of the logarithm of the joint posterior distribution of B, Phi, and f, given the observations x [equivalent to] (x[sub 1], .., x[sub T]), with respect to the conditional distribution of f given (B[sup q], Phi[sup q]) and x. The second step consists in maximizing (M) the resulting expression with respect to (B, Phi). Each iteration of the algorithm is bound to increase the likelihood, so that convergence to a, possibly local, maximum is guaranteed.
In the case of model (9), the first step results in the expression:
Multiple line equation(s) cannot be represented in ASCII text
where S is the variance-covariance matrix of the observations, and E[sub q][.] represents the expectation taken with respect to the conditional distribution of f given (B[sup q], Phi[sup q]) and x. The terms E[sub q][f[sub t]] and E[sub q][f[sub t]f'[sub t]] can be easily obtained from normal updating.
Since the joint maximization of (11) with respect to (B, Phi) is complicated, I adopt a variant of the EM algorithm, known as ECM, in which the (M) step is split into a number of conditional maximization (CM). The first CM step consists in maximizing (11) with respect to B, given Phi[sup q]. In order to implement the maximization one needs to deal with the linear restrictions (mostly zero restrictions) imposed on the matrix B. Let us call b the p x 1 vector of unconstrained parameters, and M the p x nk matrix that maps b into vec(B'). Then the first CM step yields:
Multiple line equation(s) cannot be represented in ASCII text
The last CM step delivers the estimate of Phi given B[sup q+1]:
Multiple line equation(s) cannot be represented in ASCII text
The implementation of the Newton-Raphson routine is done using the Matlab program csminwel obtained from Chris Sims. The likelihood function cab be written as:
L(B, Phi/x) = -T/2(ln |V| - tr(V[sup -1]S)), (13)
where V [equivalent to] BB' + Phi is the theoretical covariance matrix. The implementation of the Newton-Raphson routine is greatly enhanced in terms of both speed and precision by the computation of the analytical gradient. The gradient with respect to b is -TM'vec((V[sup -1] - V[sup -1]SV[sup -1])B), and the gradient with respect to the diagonal elements of Phi is -T/2diag(V[sup -1] - V[sup -1]SV[sup -1]).
The criteria for algorithm convergence were: i) incremental changes in the log-likelihood had to be less than 1e[sup -10], and ii) the sum of the square of the gradients had to be less than 1e[sup -4]. While none of the two algorithms guarantees convergence to a global maximum, the algorithms were implemented from different starting points. Also, the use of two algorithms represents an additional check for robustness. The variance-covariance matrix of the parameters is computed as the inverse of the Hessian of the likelihood at the peak. The Hessian is obtained via numerical differentiation of the gradient. The delta method was used to perform tests on non-linear functions of the parameters, like the standard deviations of asymmetric shocks.
Please address questions regarding content to Marco Del Negro, Federal Reserve Bank of Atlanta, Research Department, 104 Marietta Street N.W., Atlanta, Georgia 30303-2713, 404-521-8561, marco.delnegro@atl.frb.gov.
The full text of Federal Reserve Bank of Atlanta working papers, including revised versions, is available on the Atlanta Fed's Web site at http://www.frbatlanta.org/publica/work_papers/index.html. To receive notification about new papers, please use the on-line publications order form, or contact the Public Affairs Department, Federal Reserve Bank of Atlanta, 104 Marietta Street, N.W., Atlanta, Georgia 30303-2713, 404-521-8020.
Notes
1 See Asdrubali et al. (1996), Athanasoulis and van Wincoop (1997a) (1997b), Atkeson and Bayoumi (1993), Bayoumi and Klein (1995), Crucini (1998), Crucini and Hess (1999), Del Negro (1998a), Hess and Shin (1997) (1998), Melitz and Zumer (1999), Sorensen and Yosha (1997) (1998), van Wincoop (1995).
2 See Eichengreen (1990) among several others, and Frenkel and Rose (1997) for a critique of this literature.
3 A number of papers (e.g. Clark, 1998; Ghosh and Wolf, 1997; Kollmann, 1995; Norrbin and Schlagenhauf, 1988) study the relative importance of regional and sectoral shocks within U.S. states or regions. These studies focus on fluctuations in employment, industrial production, and productivity. This literature is reviewed in Clark and Shin (1999). Carlino and Sill (1998) and Wynne and Koo (1999) analyze co-movements across U.S. regions in per capita income and output, respectively.
4 In standard factor models identification is obtained by means of the so called canonical restrictions, which have no economic content. The set of restrictions adopted here, based on geographic proximity, may not necessarily be the best one: one can think of grouping different states on the basis of different criteria, such as the productive structure, or the level of income (see Sorensen and Yosha 1997). However, geographic proximity may in some cases be a proxy for some of these features, as geography plays an important role in defining the productive structure of a given area (see Krugman 1991).
5 As discussed in the remainder of the paper, the de-trending methods used here are first differences and the HP filter. Since under both methods a linear operator is applied to the (log of) the data, it does not matter whether one first de-trends the log of per capita state consumption (output) and the log of per capita U.S. consumption (output) and then takes the difference between the two, or vice versa.
6 The approach adopted in this paper also differs from the one followed by Stockman (1988), who estimates a similar model using dummy variables. While this model is more complicated than Stockman's in terms of implementation, it has the advantage of being more economical in terms of the number of parameters that need to be estimated. Stockman uses a dummy variable for each factor in each time period: since there are 59 factors and 26 time periods, that would have meant estimating 26 x 59 = 1508 parameters instead of 400. More importantly, Stockman's method does not allow for the same factor to have a different impact on different states, or a different impact on consumption and output in the same state. This would have been a major impediment in addressing questions regarding the relative variability of consumption and output, and the fraction of variability that can be attributed to each factor.
7 In some of the literature (for example Von Hagen and Hammond 1995) differences in the exposure to the U.S. business cycle are not included in the definition of asymmetric shocks. We chose to include them because if we ignored this component our definition of asymmetric shocks would not be consistent with the definition of perfect risk-sharing used in the literature. For states to be perfectly sharing risk it must be that: i) state or region-specific shocks do not affect their consumption, ii) common shocks affect all states in the same way. The latter requirement suggests that differences in the exposures to U.S. business cycles should be included in the definition of asymmetric shocks.
8 It is well known that one implication of complete markets when agents' preferences display constant relative risk aversion is the following (see for instance Obstfeld 1994):
c[sub it] = c[sup us][sub t] + theta[sub it]
where c[sub it] and c[sup us][sub t] represent the growth rates of consumption in state i and in the aggregate, respectively, and the term theta[sub it] represents preference shocks and/or measurement error. Due to the presence of theta[sub it] the correlation in consumption growth rates between states i and j may be less than one even under perfect risk sharing (see also Stockman and Tesar 1995).
9 Kose et al. (1999) and Forni and Reichlin (1998) estimate a dynamic factor model with annual data. Another example of estimation of a dynamic factor model is Gregory et al. (1997).
10 See Friedenberg and Beemiller (1997).
11 Regarding the quality of Sales&Marketing Management retail sales data, in Del Negro (1998b) I compare the Census and the Sales&Marketing Management non-durable retail sales data for the nineteen states for which both are available. In particular, I use a bivariate factor model of the form:
c[sup S&M][sub it] = c*[sub it] + theta[sup S&M][sub it]
c[sup C][sub it] = c*[sub it] + theta[sup C][sub it]
where c[sup S&M][sub it] and c[sup C][sub it] are the proxies for de-trended consumption of non-durables in state i obtained from Sale&Marketing Management and the Census, respectively, c*[sub it] represents the "true" measure of consumption, which is not observed, and theta[sup S&M][sub it] and theta[sup C][sub it] represent the measurement error for each proxy. The comparison between the standard deviation of theta[sup S&M][sub it] and theta[sup C][sub it] reveals that for the majority of states the Census measures are more precise. However, for none of the nineteen states am I able to reject the hypothesis that the two standard deviations are different at the 10% significance level.
12 The District of Columbia is not included in the analysis precisely for this reason.
13 The difference between including nineteen or fifty states in the analysis can be appreciated by comparing the results for data sets 1 (HS) and 4 (the output measures and the time period is the same) shown in table 4 and Fig. 2. Using the shorter (1978-1995) versus the longer (1969-1995) time span also makes some difference, as can be seen by estimating the model for data set 3 using the shorter period (the results are not shown for lack of space).
14 Asdrubali et al. uses total consumption (both private and public) as a measure of consumption. All other data sets use non durable consumption. In order for the comparison between data sets to make sense, when estimating model (6) we replace their measure of consumption with the Sales&Marketing Management measure of non durable consumption. The results are however qualitatively the same when we use their measure of consumption.
15 For models (6) and (7) the estimated standard deviation (that is, the standard deviation computed using the estimated parameters) and the actual standard deviation do not always coincide. In principle, the maximum likelihood estimates of the parameters phi[sup i] (the variances of idiosyncratic shocks) should equalize the actual and the estimated standard deviations. Due to numerical problems, this is not always the case for models (6) and (7), no matter how stringent the convergence criteria for the likelihood and for the gradient are (see appendix). This is not an issue for model (1), or for the consumption data in model (7), suggesting that the problem may arise from the presence of both common and idiosyncratic measurement error. This problem generally affects only one of the two output or consumption data, the most volatile. In essence, this numerical issue implies that the idiosyncratic standard deviation for the most volatile data is either under or overestimated. For the less volatile measure of either consumption or output the actual and the estimated standard deviations roughly coincide, and these are the figures shown in table 5 and figure 3 for the case with measurement error. In the case without measurement error the standard deviation of asymmetric shocks is the same across data sets by construction.
16 The figures shown in table 6 are the averages across states of the absolute values of the coefficients beta[sup us][sub yi], beta[sup r][sub yi], et cetera. It is important to note that these figures do not add up to the overall estimated standard deviation of asymmetric shocks, as the latter is not a linear function of the coefficients.
17 This may be the case if a large idiosyncratic movement in consumption and a small movement in output are coincidental. In fact, for these states the exposure of output on state-specific shocks is small and insignificant.
Legend for Chart: A - Data set: B - nominal consumption C - consumption deflator D - nominal output E - output deflator A B C D E Data set 1 (HS): 19 states, 1978-95 retail sales state CPI (non durables) source: source: Census Del Negro gsp gsp deflator source: source: BEA BEA Data set 2 (ASY): 50 states, 1969-95 retail sales (total)+ US CPI S&L gvmt. consumption source: source: Sales&Marketing, BLS U.S. Star. Abstract gsp US gdp deflator source: source: BEA BEA Data set 3: 50 states, 1969-95 retail sales state CPI (non durables) source: source: Sales&Marketing Del Negro gsp state CPI source: source: BEA Del Negro Data set 4: 50 states, 1978-95 retail sales state CPI (non durables) source: source: Sales&Marketing Del Negro gsp gsp deflator source: source: BEA BEA
Note: The frequency is annual for all data. In data set 2, total retail sales for each state are multiplied by the ratio of total U.S. consumption over total U.S. retail sales for the corresponding year.
consumption output Data set 1 (HS): 19 states, 1978-95 growth rates -0.000952 (0%) 0.166 (21.1%) HP filter 0.09 (0%) 0.142 (10.5%) Data set 2 (ASY): 50 states, 1969-95 growth rates 0.0305 (10 %) 0.263 (40%) HP filter 0.154 (8%) 0.253 (32%) Data set 3: 50 states, 1969-95 growth rates -0.0665 (10%) 0.21 (24%) HP filter 0.105 (12%) 0.219 (26%) Data set 4: 50 states, 1978-95 growth rates -0.0413 (8%) 0.184 (20%) HP filter 0.0992 (4%) 0.157 (10%)
Note: Averages across states. The figures in parenthesis show the percentage of states for which the autocorrelation coefficient is significantly different from zero at the 5% level.
consumption output difference % Data set 1 (HS): 19 states, 1978-95 growth rates 0.302 (0.236) 0.703 (0.186) -0.401 94.2 HP filter 0.33 (0.25) 0.776 (0.181) -0.446 94.7 Data set 2 (ASY): 50 states, 1969-95 growth rates 0.267 (0.245) 0.494 (0.331) -0.227 79.8 HP filter 0.334 (0.265) 0.557 (0.367) -0.223 78.7 Data set 3: 50 states, 1969-95 growth rates 0.314 (0.219) 0.535 (0.337) -0.221 75.8 HP filter 0.295 (0.243) 0.586 (0.383) -0.291 78.9 Data set 4: 50 states, 1978-95 growth rates 0.152 (0.284) 0.509 (0.279) -0.357 83.9 HP filter 0.131 (0.301) 0.56 (0.302) -0.428 86.2
Note: The first and second columns show the cross-sectional average (and standard deviation) of the correlations of consumption and output, respectively. The third column shows the difference between the two. The forth column shows the percentage of states for which the correlation in output is larger than the correlation in consumption.
consumption output difference % Data set 1 (HS): 19 states, 1978-95 growth rates - with idiosyncratic component 3.32 1.67 1.64 100 (100) - without idiosyncratic component 2.48 1.67 0.802 89 (100) HP filter - with idiosyncratic component 2.14 1.02 1.12 100 (100) - without idiosyncratic component 1.67 1.02 0.648 79 (100) Data set 2 (ASY): 50 states, 1969-95 growth rates - with idiosyncratic component 3.42 2.69 0.729 86 (88) - without idiosyncratic component 1.84 2.69 -0.851 26 (6) HP filter - with idiosyncratic component 2.24 1.74 0.502 86 (87) - without idiosyncratic component 1.42 1.74 -0.316 44 (37) Data set 3: 50 states, 1969-95 growth rates - with idiosyncratic component 4.19 2.75 1.44 94 (91) - without idiosyncratic component 2.22 2.75 -0.523 48 (7) HP filter - with idiosyncratic component 2.79 1.79 1 92 (91) - without idiosyncratic component 1.86 1.79 729 60 (52) Data set 4: 50 states, 1978-95 growth rates - with idiosyncratic component 4.19 2.23 1.95 94 (98) - without idiosyncratic component 2.55 2.23 0.319 58 (50) HP filter - with idiosyncratic component 2.76 1.38 1.37 92 (98) - without idiosyncratic component 1.84 1.37 0.462 66 (68)
Note: Figures are in %. The first and second columns show the cross-sectional average of the standard deviations of asymmetric shocks in consumption and output, respectively. The third column shows the difference between the two. The forth column shows the percentage of states for which the standard deviation of asymmetric shocks in consumption is larger than the standard deviation of asymmetric shocks in output, considering all states, and considering only those states for which the difference is significantly different from zero at the 5% level (in parenthesis).
consumption output difference % Data sets 1(HS) and 2 (ASY): 19 states, 1978-95 growth rates - with measurement error 3.2 1.7 1.5 100 (100) - without measurement error 2.31 1.52 0.79 79 (75) HP filter - with measurement error 2.05 1.03 1.01 95 (100) - without measurement error 1.49 0.887 0.602 89 (100) Data sets 1(HS) and 3: 19 states, 1978-95 growth rates - with measurement error 3.19 1.75 1.44 100 (100) - without measurement error 2.33 1.55 0.771 79 (75) HP filter - with measurement error 2.04 1.05 0.99 95 (100) - without measurement error 1.51 0.885 0.623 95 (100) Data set 2 (ASY) and 4: 50 states, 1978-95 growth rates - with measurement error 4.2 2.24 1.96 92 (100) - without measurement error 3.23 1.82 1.42 78 (100) HP filter - with measurement error 2.79 1.37 1.42 94 (98) - without measurement error 2.31 1.09 1.22 88 (100) Data set 3 and 4: 50 states, 1978-95 growth rates - with measurement error 4.19 2.26 1.92 94 (100) - without measurement error 3.3 1.82 1.48 78 (100) HP filter - with measurement error 2.76 1.4 1.35 92 (98) - without measurement error 2.24 1.1 1.14 84 (100)
Note: Figures are in %. The first and second columns show the cross-sectional average of the estimated standard deviations of asymmetric shocks in consumption and output, respectively. The third column shows the difference between the two. The forth column shows the percentage of states for which the estimated standard deviation of asymmetric shocks in consumption is larger than the estimated standard deviation of asymmetric shocks in output, considering all states, and considering only those states for which the difference is significantly different from zero at the 5% level (in parenthesis).
Legend for Chart: A - U.S. B - Region C - State D - Common E - Idios. A B C D E Data set 1 (HS) and 2 (ASY): 19 states, 1978-95 Output 1 -growth rates 0.63 (26) 0.62 (47) 1 (47) 0.33 (0) 0.26 (32) -HP filter 0.38 (37) 0.31 (53) 0.6 (42) 0.31 (0) 0.13 (21) Output 2 -growth rates 0.9 (84) -HP filter 0.68 (95) Consumption 1 -growth rates 1.5 (63) 0.74 (26) 1.1 (11) 0.39 (0) 1.7 (63) -HP filter 1.1 (63) 0.5 (37) 0.62 (21) 0.36 (5) 1 (68) Consumption 2 -growth rates 3 (95) -HP filter 1.9 (95) Data/I> sets 1 (HS) and 3: 19 states, 1978-95 Output 1 -growth rates 0.61 (32) 0.68 (58) 1 (53) 0.23 (11) 0.43 (42) -HP filter 0.34 (32) 0.32 (53) 0.62 (53) 0.3 (0) 0.23 (32) Output 3 -growth rates 1 (89) -HP filter 0.75 (100) Consumption 1 -growth rates 1.5 (63) 0.69 (26) 1.1 (21) 0.26 (0) 1.8 (68) -HP filter 1.1 (74) 0.51 (37) 0.62 (32) 0.33 (0) 1 (68) Consumption 3 -growth rates 2.9 (95) -HP filter 1.9 (95) Data set 2 (ASY) and 4: 50 states, 1978-95 Output 4 -growth rates 0.96 (48) 0.88 (40) 0.88 (10) 0.97 (14) 0.38 (32) -HP filter 0.39 (32) 0.62 (44) 0.58 (8) 0.63 (16) 0.25 (34) Output 2 -growth rates 1 (74) -HP filter 0.75 (86) Consumption 2 -growth rates 1.2 (16) 1.5 (34) 2.1 (16) 2.1 (10) -HP filter 0.96 (28) 1.1 (40) 1.3 (10) 1.1 (10) Data set 3 and 4: 50 states, 1969-95 Output 4 -growth rates 0.96 (44) 0.92 (44) 0.89 (6) 0.92 (4) 0.49 (36) -HP filter 0.4 (36) 0.64 (46) 0.6 (6) 0.59 (10) 0.32 (32) Output 3 -growth rates 1.3 (94) -HP filter 0.87 (94) Consumption 3 -growth rates 1.2 (22) 1.4 (30) 2.2 (14) 1.9 (10) -HP filter 0.95 (30) 1.1 (38) 1.3 (12) 1.2 (4)
Note: Figures are in %. The table shows the cross sectional average of the estimated standard deviation of asymmetric shocks due to due to the national (US), regional (Reg.), state specific (St.) factor, and to the common (Common) and idiosyncratic (Idios.) component of measurement error, as well as the percentage of states for which this is significantly different from zero at the 5% level. Specifically, the table shows the cross-sectional averages of the absolute values of the coefficients beta[sup us][sub yi], beta[sup r][sub yi], etc.
GRAPH: Figure 1: Cross correlation of output and non-durable consumption across states
GRAPH: Figure 2: Standard deviations of asymmetric shocks to output and non-durable consumption - model (1)
GRAPH: Figure 3: Standard deviations of asymmetric shocks to output and non-durable consumption - models (6) and (7)
References
Asdrubali, P., B.E. Sorensen and O. Yosha, 1996, Channels of interstate risksharing: United States 1963-1990, Quarterly Journal of Economics 111, 1081-1110.
Athanasoulis, S. and E. van Wincoop, 1997a, Growth uncertainty and risksharing, Federal Reserve Bank of New York, Staff Report 30.
Athanasoulis, S. and E. van Wincoop, 1997b, Risksharing within the United States: What have financial markets and fiscal federalism accomplished?, Federal Reserve Bank of New York, Research Paper 9808.
Atkeson, A. and T. Bayoumi, 1993, Do private capital markets insure regional risk? Evidence from the United States and Europe, Open Economies Review 4, 303-324.
Backus, D.K., P.J. Kehoe and F.E. Kydland, 1992, International real business cycles, Journal of Political Economy 84, 84-103.
Bayoumi, T. and M.W. Klein, 1995, A provincial view of capital mobility, NBER Working Paper # 5115.
Baxter, M. and R.G. King, 1999, Measuring business cycles: approximate band-pass filters for economic time series, Review of Economics and Statistics 81, 575-93.
Blanchard, O. and L. Katz, 1992, Regional Evolutions, Brookings Papers on Economic Activity, 1-73.
Carlino, G. and K. Sill, 1998, The cyclical behavior of regional per capita incomes in the postwar period, Federal Reserve Bank of Philadelphia, Working Paper # 98-11.
Clark, T.E., 1998, Employment fluctuations in U.S. regions and industries: the roles of national, region-specific, and industry-specific shocks, Journal of Labor Economics 16, 202-29.
Clark, T.E. and K. Shin, 1999, The sources of fluctuations within and across countries, in: G.D. Hess and E. van Wincoop, eds., Intranational macroeconomics, Cambridge University Press, Cambridge.
Costello, D.M., 1993, A cross-country, cross-industry comparison of productivity growth, Journal of Political Economy 101, 207-222.
Coval, J.D. and T.J. Moskowitz, 1997, Home bias at home: local equity preferences in domestic portfolios, mimeo.
Crucini, M.J., 1998, On international and national dimension of risk sharing, Review of Economics and Statistics 81, 73-84.
Crucini, M.J. and G.D. Hess, 1999, International and intranational risk sharing, in: G.D. Hess and E. van Wincoop, eds., Intranational macroeconomics, Cambridge University Press, Cambridge.
Del Negro, M., 1998a, Aggregate risk sharing across U.S. states and across European countries, mimeo.
Del Negro, M., 1998b, Aggregate risk and risk sharing across U.S. states and across European countries, unpublished Ph.D. dissertation, Yale University.
Eichengreen, B., 1990, One money for Europe? Lessons from the U.S. currency union, Economic Policy 0, 117-187.
Forni, M. and L. Reichlin, 1998, Let's get real: a factor analytical approach to disaggregated business cycle dynamics, Review of Economic Studies 65, 453-473.
Frankel, J.A. and A.K. Rose, 1997, Is EMU more justifiable ex post than ex ante?, European Economic Review 41, 753-60.
Friedenberg, H.L. and R.M. Beemiller, 1997, Comprehensive Revision of Gross State Product by Industry, 1977-94, Survey of Current Business, June.
Gelman, A., J.B. Carlin, H.S. Stern, and D.B. Rubin, 1995, Bayesian data analysis, Chapman & Hall, London.
Gosh, A.R. and H.C. Wolf, 1997, Geographical and sectoral shocks in the U.S. business cycle, NBER Working Paper # 6180.
Gregory, A.W., A.C. Head and J. Raynauld, 1997, Measuring World Business Cycles, International Economic Review 38, 677-701.
Hess, G. D. and K. Shin, 1997, International and intranational business cycles, Oxford Review of Economic Policy 13, 93-109.
Hess, G. D. and K. Shin, 1998, Intranational business cycles in the United States, Journal of International Economics 44, 289-314.
Hess, G. D. and K. Shin, 1999, Risk sharing within and across regions and industries, Journal of Monetary Economics, forthcoming.
Huberman, G., 1999, Home bias in equity markets: international and intranational evidence, in: G.D. Hess and E. van Wincoop, eds., Intranational macroeconomics, Cambridge University Press, Cambridge.
Kollmann, R., 1995, The correlation of productivity growth across regions and industries in the United States, Economic Letters 47, 437-443.
Kose, M.A., C. Otrok and C.H. Whiteman, 1999, International business cycles: world, region, and country-specific factors, University of Iowa, mimeo.
Krugman, P., 1991, Geography and trade, Leuven University Press, Leuven.
Lehmann, B. and D. Modest, 1985, The empirical foundations of arbitrage pricing theory I: the empirical tests.' NBER Working Paper No. 1725.
Melitz, J. and F. Zumer, 1999, Interregional and international risk sharing and lessons for EMU, Carnegie-Rochester Conference Series on Public Policy, 51, 149-88.
Norrbin, S.C. and D.E. Schlagenhauf, 1988, An inquiry into the source of macroeconomic fluctuations, Journal of Monetary Economics 22, 43-70.
Obstfeld, M., 1994, Are Industrial-country consumption risks globally diversified, in: L. Leiderman and A. Razin, eds., Capital mobility: the impact on consumption, investment, and growth, Cambridge University Press, Cambridge.
Sorensen, B.E. and O. Yosha, 1997, Income and consumption smoothing among U.S. states: regions or clubs?, Centre for Economic Policy Research, Discussion Paper # 1670.
Sorensen, B.E. and O. Yosha, 1998, International risk sharing and European monetary reunification, Journal of International Economics, 45, 211-38.
Stockman, A.C., 1988, Sectoral and national aggregate disturbances to industrial output in seven European countries, Journal of Monetary Economics 21, 387-409.
Stockman, A.C. and L.L. Tesar, 1995, Tastes and technology in a two-country model of the business cycle: explaining international co-movements, American Economic Review 85, 168-85.
van Wincoop, E., 1994, Welfare gains from international risksharing, Journal of Monetary Economics 34, 175-200.
van Wincoop, E., 1995, Regional risksharing, European Economic Review 39, 1545-1567.
Von Hagen, J. and G. Hammond, 1995, Regional insurance against asymmetric shocks. An empirical study for the European Community, Centre for Economic Policy Research, Discussion Paper # 1170.
Wynne, M.A. and J. Koo, 1999, Business cycles under monetary union: A comparison of the EU and US, Federal Reserve Bank of Dallas, mimeo.
~~~~~~~~
By Marco Del Negro
Title: | Analysis of Cross Category Dependence in Market Basket Selection. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Market basket choice is a decision process in which a consumer selects items from a number of product categories on the same shopping trip. The key feature of market basket choice is the interdependence in demand relationships across the items in the final basket. This research develops a new approach to the specification of market basket models that allows a choice model for a basket of goods to be constructed using a set of "local" conditional choice models corresponding to each item in the basket. The approach yields a parsimonious market basket model that allows for any type of demand relationship across product categories (complementarity, independence, or substitution) and can be estimated using simple modifications of standard multinomial logit software. We analyze the choice of four grocery store categories that exhibit common cross-category brand names for both national brands and private labels. Results indicate that cross-category price elasticities are small. We argue that store traffic patterns may be more important than consumer-level demand interdependence in forecasting market basket choice. [ABSTRACT FROM AUTHOR] |
AN: | 3701300 |
ISSN: | 0022-4359 |
Database: | Business Source Premier |
Market basket choice is a decision process in which a consumer selects items from a number of product categories on the same shopping trip. The key feature of market basket choice is the interdependence in demand relationships across the items in the final basket. This research develops a new approach to the specification of market basket models that allows a choice model for a basket of goods to be constructed using a set of "local" conditional choice models corresponding to each item in the basket. The approach yields a parsimonious market basket model that allows for any type of demand relationship across product categories (complementarity, independence, or substitution) and can be estimated using simple modifications of standard multinomial logit software. We analyze the choice of four grocery store categories that exhibit common cross-category brand names for both national brands and private labels. Results indicate that cross-category price elasticities are small. We argue that store traffic patterns may be more important than consumer-level demand interdependence in forecasting market basket choice.
The advent of retail scanner data has created a wealth of information about consumer behavior for both retailers and manufacturers. This information revolution has allowed researchers to study the pattern of brand competition within specific product categories (e.g., Cooper, 1988; Grover and Srinivasan, 1992; Russell and Kamakura, 1994). However, researchers and managers are increasingly interested in understanding the pattern of brand competition across product categories (Blattberg, 1989, Blattberg and Neslin, 1990).
Our research is based upon a growing body of work known as market basket analysis. Market basket analysis is a generic term for methodologies that study the composition of the basket (or bundle) of products purchased by a household during a single shopping occasion. Current commercial applications emphasize electronic couponing, the tailoring of coupon face value and distribution timing using information about the household's basket of purchases (Catalina Marketing, 1997); and affinity analysis, the design of store layout according to the coincidence of pairs of items in a market basket (Brand and Gerristen, 1998). Both types of applications are based upon the belief that sales in different product categories in the market basket are correlated. The patterns in these correlations are then used to make marketing strategy recommendations.
The academic literature in market basket analysis is small, but growing. This literature attempts to go beyond the correlational approaches found in the marketing research industry by identifying the sources of cross-category dependence in market basket selection. Market basket analysis is regarded as a pick-any choice problem. When consumers enter a store, they are confronted by a large number of possible product categories that may be purchased. Consumers may then select all, none, or any subset of the available categories. The key questions facing choice researchers in marketing are whether the multiple decisions in a pick-any choice task are related and how marketing managers can use cross-category linkages to develop marketing strategies (Russell, Ratneshwar, Shocker et al., 1999).
Two different explanations for cross-category linkage have been suggested: store choice and global utility. Store choice models argue that sales in different categories are related because the mix of consumers in the store changes from week to week due to marketing activity. Because product preferences are correlated across categories (Russell and Kamakura, 1997), the changing mix of consumers over time creates cross-category correlation in store-level sales data.
For example, Bodapati and Srinivasan (1999) develop a model in which consumers use knowledge about prices and feature advertising to select a retailer who delivers the least expensive basket of goods. Once the retailer is selected, the basket is formed by a series of conditionally independent consumer decision models predicting category incidence, brand choice and purchase quantity. Bell and Lattin (1998) develop a model that relates store choice to the pricing strategy of a store across multiple product categories. The decision to make a major shopping trip (i.e., allocate a larger dollar expenditure to the basket) alters store choice and also increases the number of items in the market basket. Models in this research stream assume that cross-category dependence is found at the market level (due to store traffic effects), but is not found at the consumer level (because the household does not view purchases in different categories as related).
In contrast, global utility models argue that cross-category dependence is present within the choice process of each consumer. In these models, cross-category choice correlations exist because consumer preference for an item in one category is contingent on the consumption of items in other product categories. Harlam and Lodish (1995) link choices across potential complements within the same product category (different flavors of powdered soft drinks) by making utility for the current choice dependent upon the attributes of previously selected items (the flavors of previously selected drinks). Erdem (1998) demonstrates that the utilities for products in different product categories that share the same brand name jointly covary with product experience. Manchanda, Ansari and Gupta (1999) use the multivariate probit model to show that true consumption complements (detergent and fabric softener) exhibit choice dependence within a shopping trip. In effect, global utility models argue that cross-category choice dependence exists even when store traffic effects are unimportant.
Our work extends the market basket literature by proposing a new method of studying the strength of cross-category choice dependence within each consumer's purchase history. Drawing upon modeling techniques from the spatial statistics literature, we show how a researcher can develop a global utility model for the entire basket of purchases by constructing linked choice models for each category individually. This approach has two important advantages. First, we obtain a parsimonious global utility model that allows for any type of demand relationship across product categories (complementarity, independence, or substitution) within the choice process of each consumer. That is, the within-consumer pattern of demand effects is unrestricted. Second, because the building blocks of the approach have a logistic form, we demonstrate that the resulting market basket choice probabilities follow a multivariate logistic distribution. As we show subsequently, this model form allows response parameters to be estimated using simple modifications of standard multinomial logit software.
We begin by discussing a general method of building a complex global choice model for the market basket from relatively simple single category choice models. By specializing this general approach, we derive the multivariate logistic market basket model, and explore its implications for Consumer choice behavior. We then apply the approach to the choice of four grocery store categories that exhibit common cross-category brand names for both national brands and private labels. Although we find evidence of cross-category choice dependence, the magnitude of the cross-category price effects is modest. Substantively, our results suggest that store traffic effects may be more important than global utility effects in modeling market basket choice.
The market basket model developed in our research is built using a flexible approach to model specification developed in the spatial statistics literature. This approach, which we call conditional choice specification, essentially allows the researcher to specify a global model (the choice of entire basket of items) by specifying a series of local models (a choice model for each item in the basket). Intuitively, this method assumes that choices are made in a certain order, but does not require the researcher to actually know the order in which choices are made in building up the basket.
Multiple Category Choice
Suppose that the researcher is attempting to model choice activity across four product categories: A, B, C, and D. Consumers make choices in each of the four categories in some sequence that is not observed. In a statistical sense, we can think of each basket as consisting of four random variables (corresponding to the buy and not buy decisions of the consumer). Clearly, the choice process implies that the consumer will select one of 2[sup 4] = 16 possible market bundles. This is a pick-any choice task: consumers may select all, none, or any subset of the four available categories.
Formally, the choice process for the entire basket can be expressed in terms of a four dimensional multivariate distribution p(A,B,C,D) that defines the relative likelihood of each of the 16 possible market baskets. At this point, two different strategies may be advanced for constructing a choice model. First, we could attempt to directly specify p(A,B,C,D) based upon some prior knowledge of how category features interact in the consumer's utility function. This is equivalent to a direct specification of a global utility function over a bundle of items [see, e.g., Farquhar and Rao (1976) and McAlister (1979)]. Second, we could assume that the choices that lead to the construction of a basket are made in a known order [Kamakura et al. (1991), Harlam and Lodish (1995)]. Using this sequential information, we could then develop a model in which the consumer evaluates the utility of the current item relative to a cumulative variable representing the composite utility of the basket of previous choices. For example, if D is considered first, C second, B third, and A fourth, then we would write:
(1) p(A,B,C,D) = p(A\B,C,D) p(B\C,D) p(C\D) p(D)
where the notation p(x|y) denotes the probability of x given y.
Both approaches are difficult to implement in a retail market basket setting. Directly specifying the utility function for the market basket is problematic because it requires both a detailed understanding of cross-category demand relationships (complementarity, independence, or substitution) and an explicit enumeration of all possible market baskets. In contrast, building the utility model sequentially is attractive conceptually, but makes the unrealistic assumption that the retailer can readily observe the order in which choices are made.
Conditional Choice Specification
The conditional choice specification approach proposed here avoids both the direct specification of the joint distribution p(A,B,C,D) and the assumption of a particular choice order. The logic of the approach is best explained by considering the following scenario. Suppose that we follow the consumer around the store and observe each choice decision. Towards the end of the shopping trip, we find that the consumer has made choices in three categories (A, B, and C) and is now considering whether or not to buy in the last category (D). The conditional distribution approach assumes that we can specify p(D\A,B,C), the probability of buying in this last category, given the known outcomes of the previous choice decisions. In consumer behavior terms, specifying this conditional distribution is equivalent to specifying the utility of category D given the attributes of category D and the context created by the earlier choices.
However, recall that the researcher does not usually know the identity of the last category. For this reason, we replace assumptions about sequence with assumptions about the following set of conditional distributions: p(A\B,C,D), p(B\A,C,D), p(C\A,B,D), and p(D\A,B,C). These so-called full conditional distributions correspond to placing each category (in turn) as the last choice decision and then specifying the conditional probability of selecting this last category. Although the true decision sequence is not known, the researcher is assumed to be able to develop a set of choice models that collectively describe the last decision in any possible decision sequence.
Remarkably, by using only information about these full conditional distributions, the researcher can infer all properties of p(A,B,C,D), the probability distribution that describes the relative likelihood of each possible bundle. Technically, this can be done because there exists a one-to-one correspondence between full conditional and joint distributions: for any probability model (regardless of structure), the complete set of full conditional distributions uniquely determines the joint distribution p(A,B,C,D) provided that all full conditional distributions are mutually consistent (Besag, 1974; Cressie, 1993). Details on this theoretical relationship are discussed in the next section. Intuitively, the procedure works because the full conditional distributions completely define the dependencies in choice decisions across the entire choice bundle {A,B,C,D}.
The conditional choice approach is of great practical importance because it is much easier for a researcher to use marketing theory to specify a choice process one decision at a time (the full conditional distributions) than to specify a choice process for the entire market basket simultaneously (the joint distribution). By focusing attention on only one choice decision, the researcher can draw upon the marketing literature on single category choice (e.g., Kamakura and Russell, 1989; Gupta, 1988; Jain, Vilcassim, and Chintagunta, 1994) during model specification. Nevertheless, the result of the conditional choice approach is a complete market basket model that implies dependence of choice on decision sequence and allows for flexibility in demand relationships among categories (complementarity, independence, or substitution).
In this section, we construct a market basket model by assuming that the conditional probability of choice in one category, given the actual choices in all other categories, can be expressed in the form of the logit model. We demonstrate that the implied choice model for all market baskets can be expressed as a multivariate logistic distribution (Cox, 1972). This model permits the researcher considerable flexibility in assessing the degree of correlation among choices in different categories. In particular, it allows the researcher to predict how marketing activity in one category impacts choice in other product categories.
Choice Problem
Let k denote a consumer and t denote a time point. Assume that the consumer has i = 1,2, ... N categories available for purchase. Then, we define a market basket as the vector of category choices
B(k,t) = {C(1,k,t),...,C(N,k,t)}
where C(i,k,t) = 1 if consumer k buys category i at time t (and equals 0 otherwise). Because each C(i,k,t) can take on only two values, our notation implies that there are 2[sup N] possible baskets that could be selected. (One of these baskets--the null basket--corresponds to nonpurchase across all categories.) Accordingly, the market basket model developed subsequently assigns a choice probability to each of these 28 baskets.
Conditional Choice Models
The conditional choice methodology requires the researcher to specify the probability that each category will be chosen, conditional upon the known choices in all other categories. Here, we assume that the conditional utility of consumer k for product category i at time t is given by
(2) U(i,k,t) = beta[sub i] + HH[sub ikt] + MIX[sub ikt] + SIGMA[sub j is not equal to i] theta[sub ijk] C(j,k,t) + epsilon(i,k,t)
where HH[sub ikt] denotes variables defining household characteristics, MLX[sub ikt] denotes variables defining the marketing mix variables, and epsilon(i,k,t) is a random error with mean zero. The term SIGMA[sub j is not equal to i] theta[sub ijk]C(j,k,t) links the choice of the current category i to the actual choice decisions in all other product categories in the basket.
Note that theta[sub ijk] > 0 implies a positive association between product categories, while theta[sub ijk] < 0 implies a negative association. For reasons which will become clear subsequently, logical consistency requires that the coefficients on the observed choice variables be symmetric (theta[sub ijk] = theta[sub jik]). Because theta[sub ijk] depends upon the household k, we allow the magnitude of the cross effects to vary across households.
Without loss of generality, we assume that the probability of buying category i, conditional upon the choice outcomes in all other categories, equals the probability that U(i,k,t) > 0. Further, by assuming that the random error has an extreme value distribution, the conditional probability of selecting category i can be expressed as the logit model
(3) Pr(C(i,k,t) = 1|C(j,k,t) for j is not equal i) = [1 + exp{-Z(i,k,t)}][sup -1]
where
(4) Z(i,k,t) = beta[sub i] + HH[sub ikt] + MIX[sub ikt] + SIGMA[sub j is not equal to i] theta[sub ijk]C(j,k,t)
is the deterministic portion of equation (2). Intuitively, one assumes in this model that the consumer's choice of the final category in the basket (category i) is affected by the bundle of categories already selected. In this way, the probability of choice in one category is dependent upon the context created by previous choices.
Market Basket Model
Although the full conditional models in Equations (3) and (4) implicitly link all categories in a common framework, the form of the market basket distribution is not evident by simple inspection. For this task, we turn to Besag's (1974) remarkable characterization theorem, which provides a simple mathematical way of deriving a joint distribution given a set of full conditionals (see Appendix A). In this application, the set of full conditional distributions is given by Equations (3) and (4), whereas the joint distribution refers to the distribution of the market baskets B(k,t) = {C(1,k,t),..., C(N,k,t)}.
By applying Besag's (1974) theorem, we obtain the following key result. Suppose that basket b has contents {X(1,b), X(2,b),...,X(N,b)}. (In this notation, X(i,b) is a dummy (0-1) variable that takes on the value one if category i is included in basket b.) Then, given Equations (3) and (4) and the assumption that cross effect coefficients are symmetric (theta[sub ijk] = theta[sub jik]), the probability of selecting basket b is given by
Pr(B(k,t) = b) = exp{Mu(b,k,t)}/SIGMA[sub b, sup *] exp{Mu(b[sup *,k,t]}
where
(6) Mu(b,k,t) = SIGMA[sub i]beta[sub i]X(i,b) + SIGMA[sub i] HH[sub ikt]X(i,b) + SIGMA[sub i]MIX[sub ikt]X(i,b) + SIOGMA[sub i<j] theta[sub ijk] X(i,b)X(j,b)
is the imputed utility of basket b. Notice that this model predicts the probability of selecting each of the 2[sup N] baskets using only the set of parameters needed to define the conditional logit models.
An appreciation of the structure of the model can be obtained by considering a simple setting in which a given consumer k chooses from only two categories (X1 and X2). In Table 1, we show the terms for Mu(b,k,t) corresponding to the four possible market baskets. Notice that the structure the Mu(b,k,t) is identical to the terms of a log linear model containing main effects and two way interactions. In particular, it is easy to show that the cross effect term obeys the relationship
(7) theta[sub 12] = [Pr(X1 = 1,X2 = 1)/Pr(X1 = 0,X2 = 1)]/ [Pr(X1 = 1,X2 = 0)/Pr(X1 = 0,X2 = 0)]
where the right hand side is the so-called odds ratio measuring association in the table. Because the odds ratio is symmetrical in the category indices (1 and 2), the cross effect coefficient must be symmetrical as well. Intuitively, this explains why symmetry in the theta[sub ijk] is necessary to derive the basket model. In general, theta[sub ijk] has the same sign (positive, negative, or zero) as the correlation between C(i,k,t) and C(j,k,t).
Model Interpretation
This model can be interpreted in two different (but logically equivalent) ways. Viewed from the standpoint of the 2[sub N] market baskets b, Equations (5) and (6) can be viewed as a logit choice model defined over a set of alternatives (baskets) with a particular utility specification Mu(b,k,t). This interpretation facilitates model calibration since standard logit software can be easily adapted to obtain maximum likelihood estimates of the parameters in Equation (6). It is important to understand that the set of variables {X(i,b) for i = 1, 2,...N} simply describe the contents of each particular basket. For example, the null basket (no purchases in any category) has the following set: X(i,b) = 0 for all categories i. In contrast, a basket consisting only of category 1 has the following set: X(1,b) = 1 and X(i,b) = 0 for all categories i different from category 1. Clearly, the X(i,b) are known to the researcher because they define the types of baskets available to the consumer. For this reason, once the market basket model in Equations (5) and (6) is calibrated, it may be used for forecasting purposes.
The model can also be interpreted as a multivariate distribution defined over the vector of binary random variables B(k,t) = {C(1,k,t),...C(N,k,t)}. Equations (5) and (6) are in the form of the multivariate logistic distribution (Cox, 1972), a general distribution for correlated binary random variables. For this reason, it is equally valid to state that the joint probability Pr(C(1,k,t) = X(1,b), C(2,k,t) = X(2,b),...,C(N,k,t) = X(N,b)} is given by Equation (6). In particular, the conditional probabilities Pr(C(i,k,t) = 1 | C(j,k,t), j is not equal to i) implied by Equation (6) are identical to the conditional logit expressions assumed in Equations (3) and (4). As we show in our empirical work, this view of the market basket model facilitates the computation of cross-category price elasticities.
This discussion points up an important fact about the conditional choice specification approach. The market basket model in Equations (5) and (6) is not derived from a standard random utility maximization argument. In particular, we do not begin with the utilities in Equation (6), add a random error, and then derive choice probabilities for the market baskets. Instead, we begin with the conditional logit models in Equations (3) and (4) and then derive the implied market basket model [Equations (5) and (6)] using Besag's (1974) theorem. According to this theorem, the only market basket model consistent with the assumed conditional logit models is the multivariate logistic distribution of Cox (1972). Accordingly, the logit form of the basket model follows from our conditional choice distribution assumptions. Put another way, once a researcher accepts the form of the conditional choice models in (3) and (4), the multivariate logistic form of the market basket model follows immediately as a logical consequence.
Summary
At this point, the key features of the multivariate logistic basket model should be clear. By introducing cross effects into the conditional logit models of Equations (3) and (4), we are able to build a parsimonious basket selection model that accommodates a general pattern of dependence across product categories. This cross-category dependence is a direct consequence of the fact that the model implicitly defines a general utility function over all 2[sup N] possible market baskets.
In this section, we apply the multivariate logistic market basket model to the analysis of basket choice involving four paper goods categories. We show that the model predicts choice better than a simpler model that assumes independence in choice across the categories. This analysis shows that marketing mix actions that increase choice probability in one paper goods category impact choice probabilities in the remaining categories. However, the magnitude of these effects--measured from the perspective of cross-price elasticities--is small.
The data are taken from a purchase panel of 170 households in the Toronto, Canada metropolitan area over a 2-year period. Purchases are recorded for four paper goods categories: paper towels, toilet paper, facial tissue, and paper napkins. These data were selected for analysis because the four categories contain national brand names and private labels that cut across product category boundaries (see Russell and Kamakura, 1997 for details). Moreover, paper goods products are typically bulky and are usually located in the same area of a grocery store. For these reasons, there is reason to suspect a priori that choice across these categories will be correlated.
The analysis was conducted by splitting the data into three consecutive periods. The first 30 weeks of the data were used to create household-specific category loyalty variables (defined subsequently). The remainder of the data was split into two sets: a model calibration period (2,578 baskets over 41 weeks) and a holdout period (822 baskets over a subsequent 30 weeks). Price levels and inter-purchase times are very similar across the two time periods (Table 2). However, average loyalty values differ across calibration and holdout time periods because 61 households have no records during the weeks covered by the holdout data.
The market basket distribution shows a highly skewed pattern. These data, listed in Table 3, identify each basket in terms of its contents: P = paper towels, T = toilet paper, F = facial tissue, and N = paper napkins. As might be expected, smaller baskets occur much more frequently than larger baskets. Note that Table 3 does not contain a frequency count for the null basket (i.e., no purchase in any of the four categories). This is due to the fact that the dataset was constructed conditional upon the household buying at least one of the four paper goods categories on a given shopping trip. In analyzing these data, we make an adjustment for the fact that the null basket is never observed.
Before the formal analysis is discussed, it is interesting to examine the conclusions that would be drawn from an affinity analysis (Brand and Gerristen, 1998). Affinity analysis compares the observed market basket distribution to a hypothetical distribution that would be observed if the presence of a category in a basket is statistically independent of the presence of any other category. By identifying categories that co-occur more (or less) frequently than expected, retail policy recommendations (such as store layout) are developed. Intuitively, affinity analysis can be regarded as a method of clustering categories into groups that are purchased on the same shopping occasion.
To illustrate affinity analysis for our data, we fit a main effects log linear model to the Table 3 calibration basket counts and obtained a forecasted distribution. Because main effects log linear models assume independence across the factors in a contingency table, the forecasts of this model are equivalent to an affinity analysis benchmark. This benchmark distribution, labeled "Independence" in Figure 1, is significantly different from the observed basket distribution as judged by a chi-squared test (p < .0001). Notice that baskets of size two appear less often than expected, whereas baskets of size one generally occur more often than expected.
Accordingly, affinity analysis would argue that the four paper goods categories act as substitutes: the presence of one category in the basket decreases the likelihood that another category will be in the basket. As we show subsequently, these conclusions are not correct. The key problem is that affinity analysis is subject to biases because it ignores both consumer heterogeneity and marketing mix effects.
Model Specification
To specialize the market basket model to the paper goods dataset, we define the household characteristic (HH[sub ikt]) and marketing mix (MIX[sub ikt]) terms of Equation (6). We assume that characteristics of household k can be expressed as
(9) HH[sub ikt] = delta[sub 1] log[TIME[sub ikt] + 1] + delta[sub 2] LOYAL[sub ik]
where TIME[sub ikt] is the time in weeks since the household's last category purchase and LOYAL[sub ik] is a loyalty variable that adjusts for the household's long-run propensity to buy the category.(n1) We define LOYAL[sub ik] = log([n(i,k) + .5]/[n(k) + 1]) where n(i,k) is the number of product category i purchases across the household' s n(k) purchase events in the initial 30 weeks of the dataset. Because TIME[sub ikt] is a surrogate for category inventory and LOYAL[sub ik] is a measure of interest in the product category, we expect both delta[sub 1] and delta[sub 2] to be positive.
In this analysis, the marketing mix of the basket model is defined as
(10) MIX[sub ikt] = gamma[sub 1] log[PRICE[sub ikt]]
where PRICE[sub ikt] is a price index for category i at time t. (The dependence on household k is due to the fact that different households face different marketing environments.) The index is a weighted average price taken across all stock keeping units (SKU's) in the category. Weights are long-run volume shares for the SKU's for the entire purchase panel over the first 30 weeks of the dataset. Price is defined in terms of dollars per equivalent unit. Category level promotional variables (feature and display) are excluded from the analysis due to high correlation with price, Consequently, the price coefficient in the model captures both regular price and promotional effects. We expect gamma[sub j] to be negative.
A key component of the model is the specification of the theta[sub ijk] cross-effect terms. We model these effects as
(11) theta[sub ijk] = delta[sub ij] + phiSIZE[sub k]
where the basket size loyalty variable SIZE[sub k] is set to the mean number of categories per trip chosen by household k during the initial 30-week period of the data. To force the cross effects to be symmetrical within each household, we impose the constraaint that the sigma[sub ij] b symmetrical with respect to categories i and j. Equation (11) accounts for the fact that shopping style will have an impact on the magnitude of the cross effects. If a household shops infrequently, it is more likely to buy a large basket of goods on each trip. Basket size loyalty measures this type of behavior. Clearly, households that have larger baskets on a typical shopping trip should exhibit larger cross effects in the market basket model. Accordingly, we expect 4) to be positive.
Model Calibration
Because these data were constructed in such a way as to exclude the null basket from consideration, it is necessary to slightly alter the form of the basket model. Recall that Equations (5) and (6) assume that all 2[sup N] possible baskets are available for selection. However, to analyze these data, we need the form of the market basket distribution, conditional upon the knowledge that purchases are made in at least one category. That is, we need to constrain the basket choice model to the 2[sup N] -- 1 alternative baskets that can be observed in these data.
Given the form of Equation (5), this constrained basket choice model is extremely easy to infer. Let 0 denote the null basket. Then, using (5), the probability that a basket contains at least one product category is
(12) Pr(B(k,t) is not equal to 0) = SIGMA[sub b, sup *] is not equal to 0] exp{Mu(b[sup *], k,t)}/SIGMA[sub b, sup *] exp{Mu(b[sup *],k,t)
where the numerator runs over all baskets that are not empty. Accordingly, by taking the ratio of Equation (5) to Equation (12), we find that the probability of selecting basket b, given that b is not the null basket, is
(13) Pr(B(k,t) = b | B(k,t) is not equal to 0) = exp{Mu(b,k,t)}/SIGMA[sub b, sup *] is not equal to 0] exp{Mu(b[sup *],k,t)}
where the denominator runs over all baskets that are not empty. In essence, all we need do is retain the form of the original basket model, but exclude the null basket from the possible alternatives. Notice that the parameters in (13) are the same as the parameters in the original model. Hence, we will obtain consistent estimates of all parameters in the full basket model, despite the fact that we do not observe the selection of the null basket.
Parameter estimation is straightforward, again due to the form of the basket model. Note that the formal structure of the market basket likelihood function is identical to a single category logit likelihood defined over 2[sup N] -- 1 = 15 possible alternatives. That is, given our basket model, we can approach model calibration as if the household made one selection out of 15 alternatives on each purchase occasion using the probabilities defined by Equations (13) and (6). Accordingly, model parameters were computed using a standard multinomial logit maximum likelihood estimation algorithm.(n2) The form in which explanatory variables enter the model follows the general structure of Equation (6).
Model Comparison
Before interpretation of the multivariate logistic basket model is attempted, two key questions should be addressed. First, does the multivariate logistic basket model represent an improvement in forecasting ability relative to an analysis of each category separately? Second, is there any evidence that the multivariate logistic model is correctly specified? Both questions are important because they deal with the managerial usefulness of the results. To address these questions, we compare the performance of various types of market basket models with respect to the paper goods data (Table 4).
Two classes of models are considered: Multivariate Logistic models, and Benchmark models. The Multivariate Logistic models are different versions of the basket model proposed in this research. Model C1 (Independence) assumes that all cross effects theta[sub ijk] are zero. This model is equivalent to fitting a separate logit model for each of the four paper goods categories. Model C2 (Simple Cross) assumes that the cross effects theta[sub ijk] do not vary across households. Model C3 (Full Cross Effects) allows cross effects to vary with respect to basket size loyalty, as specified by Equation (11).
The Benchmark models do not follow the logic of the multivariate logistic model discussed earlier. Instead, they use different specifications of category price variables to add cross category effects to the model. Models B1 and B2 begin with the Independence model (C1) and add price terms. In model B1, the MIX variable in the conditional logit models [Equations (1) and (2)] is assumed to depend the prices of all categories--not just the price of the given category. In model B2, Equation (11) is modified to
(14) theta[sub ijk] = tau[sub ij] log[PRICE[sub ikt]] log[PRICE[sub jkt]]
where tau[sub ij] is symmetrical in categories i and j. Models B3 and B4 are constructed in an analogous way, but use the Full Cross Effects model (C3) as the base. Model B3 adds cross price terms to the conditional logit models [Equations (1) and (2)]. Model B4 modifies Equation (11) to
(15) theta[sub ijk] = delta[sub ij] + phiSIZE[sub k] + tau[sub ij] log[PRICE[sub ikt]] log[PRICE[sub jkt]]
where sigma[sub ij] and tau[sub ij] are symmetrical in categories i and j. In each case, Besag's (1974) theorem is used to derive a corresponding market basket model (in logit form), which is then estimated. Because each of these models allows cross-category effects to be represented in a different way, they serve as reference points to the proposed multivariate logistic model specification developed earlier.
In Table 4, we, use both the Bayesian Information Criterion (BIC) and the log likelihood in the holdout data (HLL) to select the best model for the paper goods data. The BIC adjusts the log likelihood in the calibration data for the number of parameters estimated, whereas the HLL uses model forecasts to construct the log likelihood of the holdout data. Both criteria are commonly used in marketing science to select among competing models. The smallest BIC value and the largest HLL value identify the best model.
The clear conclusion is that the Full Cross Effects model (C3) best represents the choice process. Because the Full Cross Effects model is better than the Independence model (no cross effects), we can conclude cross-category information is important.(n3) That is, choices across categories are correlated. Moreover, the superiority of the Full Cross Effects model over all the Benchmark models provides strong evidence that the multivariate logistic approach discussed earlier is a reasonable way of capturing these cross-category effects. For this reason, the remainder of the discussion is focused on the Full Cross Effects model.
Parameter Estimates
The parameter estimates for the Full Cross Effects model are presented in Table 5. Setting aside the category-specific intercepts, all parameters are statistically significant and have the expected signs: negative for all price coefficients, and positive for coefficients corresponding to loyalty, time since last purchase, and basket size loyalty. It should be noted that the parameters in each column of the table correspond to one of the four conditional logit models defined by Equations (3) and (4). However, the parameters collectively define the multivariate logistic basket model of Equation (6). Note that we are able to obtain estimates of all parameters in the model, despite the lack of information on null baskets.
Interpretation of the demand relationships depicted by the cross effects is difficult because basket size loyalty varies across householdes. To gain insight into the cross-category effects, we present the cross effects for a typical household in Table 6. These values were computed by replacing SIZE[sub k] in Equation (11) by the value 1.54, the mean of SIZE[sub k] across all households. Three of the categories (paper towels, toilet tissue, and facial tissue) have positive cross effects and consequently act as demand complements. The clear outlier is paper napkins, which has a mixture of positive and negative relationships with respect to the remaining categories. This pattern is consistent with the fact that the paper napkin category has the longest interpurchase time of the four paper goods categories in this study (Table 2).
In reading Table 6, it is important to bear in mind that the magnitude of these effects differs across households. Because the coefficient on SIZE[sub k] is positive, households that tend to buy more categories per shopping trip on average will have larger coefficients and are more likely to exhibit complementarity in cross-category relationships. In contrast, households that buy fewer categories per shopping trip on average will have smaller coefficients and are more likely to exhibit substitution across categories. As we show subsequently, these same effects are evident when cross-category price elasticities are calculated.
Cross-Category Price Elasticities
From a managerial perspective, the most interesting aspects of this research are cross-category price elasticities. In Table 7, we display the percentage change in the choice share of the row category with respect to a one percent change in the price of the column category. These elasticities are computed using the forecasts of the Full Cross Effects model and an elasticity formula developed in Appendix B. The elasticities take into account consumer heterogeneity and are computed with respect to the long-run choice shares of the entire market. Consequently, they can be interpreted as the pattern of cross-category price elasticities that a retailer would observe in a typical week.
Several aspects of Table 7 are noteworthy. First, as might be expected, own-price effects (the diagonal of the matrix) are less than one in absolute value--implying inelasticdemand for the four paper goods categories. (In contrast, single-category studies of brand price competition typically show elastic demand.) Second, most of the cross elasticities are negative--implying complementarity. Again, the exception is the paper napkins category, which acts as a substitute with respect to paper towels and facial tissue. These cross elasticities are asymmetric (despite the symmetry imposed upon the cross-category coefficients theta[sub ijk]). In general, the properties of this elasticity matrix are reasonable given that our analysis examines choice across product categories.
The most striking aspect of this analysis is the small size of the cross-price effects. Although patterns of complementarity and substitution are present in Table 7, cross-category spillover effects due to price are not very important in terms of market-level category choice shares. However, recall that the Full Cross Effects model (used in computing the cross-category price elasticities) fits the data better than an Independence model, which assumes that no cross-category correlation in choice exists. Moreover, this improvement can be detected when forecasting to a holdout data period (Table 4). Taken together, these findings suggest that cross-category correlation in choice exists, but is due primarily to the consumer's shopping style. Consumers do buy paper goods categories together on the same shopping trip, but variation of price in one category has a modest impact upon sales in other paper goods categories.(n4)
Research Implications
It is important to place these conclusions in the proper context. The key attributes linking the four paper goods categories are proximity in store layout and cross-category brand names. These features are apt to be much weaker determinants of demand interdependence than true consumption complementarity. Earlier work by Bodapati and Srinivasan (1999), using a set of categories with no obvious consumption complementarities, also found weak cross-category demand effects. Indeed, the only market basket study to date that has found relatively large cross-category elasticities examined strong consumption complements such as cake mix and frosting (Manchanda et al., 1999). An emerging generalization may be that strong consumer perceptions of category relatedness are necessary before cross-category price spillover effects will be observed in a market basket context.
It is important to understand that market basket models are designed to forecast choice behavior within the grocery store. In effect, the consumer is assumed to be already in the store and variables such as category price level are used to predict which basket of categories will be selected. Because market basket models do not address store traffic effects, the lack of strong cross-category price elasticities in this study should not be interpreted as evidence that category pricing activity has no impact on the types of baskets that the retailer will observe in particular week. It is entirely possible that the major impact of category pricing is on store choice--an aspect of choice behavior that is not modeled in this research. Because there is strong evidence that preferences are correlated across product categories (Russell and Kamakura, 1997), cross-category demand effects may be largely determined by week-to-week fluctuations in the set of households buying in a particular store.
In fact, our data do provide an indirect indication that prices could influence store choice. In Table 8, we repeat the elasticity analysis of Table 7, but separate households into two groups, depending upon how many paper goods categories are purchased on a typical shopping trip. As might be expected, the small basket group shows more substitution across categories, than the large basket group. Again, these cross effects are small. The key finding is that own price effects are uniformly larger for the small basket group than for the large basket group. (Compare the diagonal terms of the two elasticity matrices in Table 8.) It may in fact be the case that paper goods promotions will differentially attract consumers who tend to buy smaller market baskets. This type of "cherry picking" behavior is not advantageous to the retailer, but may be a stable characteristic of consumer behavior. Bell and Lattin (1998) also provide evidence that small basket consumers exhibit higher price sensitivity with respect to category choice than large basket consumers.
This research develops a new approach to market basket construction based upon the notion that choice in one category impacts choices in other categories. The approach assumes that the researcher can specify the probability that a consumer chooses one category in the basket, given information on the actual choice outcomes in all other categories. We show that by using these conditional choice models, it is possible to infer the market basket distribution that explains purchasing in all categories. We applied the approach to the analysis of choice in four paper goods categories. Substantively, we showed that choice across four paper goods categories is correlated, but that the within-store magnitudes of cross-category price effects are modest.
Methodological Contribution
The model developed here has a number of advantages. Although the logic behind the model is consistent with the sequential choice approach of Harlam and Lodish (1995), it does not require the researcher to actually observe the order in which choices are made. Given the fact that decision sequences are rarely recorded in consumer purchase histories, this feature gives the proposed approach much greater applicability. Moreover, the form of the model developed here is computationally very attractive. Because the multivariate logistic distribution shares certain similarities with logit choice models, estimation algorithms developed for single category logit analysis can be readily adapted to the analysis of choice in multiple product categories. In effect, by thinking of the process as one of choosing baskets rather than individual categories, we are able to recast a multiple category decision model into a single category choice framework. The result is a analytically-tractable market basket choice model that allows general patterns of choice dependence (complementarity, independence, or substitution).
Managerial Contribution
Despite the failure to detect strong cross price effects across the four paper goods categories, this model may nevertheless prove useful in studying the impact of cross category marketing activity for other groups of categories. If the proposed market basket analysis were conducted with a large number of categories, it would be possible to describe the retail category assortment in terms of the strength of cross-category relationships. Blocks of categories with strong complementarity relationships are particularly interesting because promotional activity in one category can simultaneously increase sales in other categories within the same block. If the retailer were to discover only weak cross-category effects, then market baskets can be forecasted using independent category-level choice models. Under this scenario, understanding the long-run preferences of consumers and predicting store choice among these consumers are more important in forecasting than modeling cross-category choice correlations at the time of purchase.
It is likely that cross-category price effects are strongly affected by perceived relatedness of product categories. Our inability to find strong cross-category effects for the paper goods categories provides evidence that cross-category branding and physical proximity in the store do not provide point-of-purchase cues which stimulate cross-category purchasing. Rather, true consumption complementarity (such as the joint usage of detergent and fabric softener) appears to be required to generate strong cross-category effects. However, retailers potentially can enhance perceptions of cross-category relatedness by using merchandising tools (point-of purchase materials, cross-category coupons, and creative store layout) to suggest cross-category consumption goals to the consumer. The methodology developed in this research could be a useful tool in measuring the success of such retailer actions in building larger market baskets. Given the strong interest by retailers in category management, the development and assessment of marketing policies that promise inter-category synergies is clearly of interest.
Future Work
There are a number of limitations to the current model, all of which provide opportunities for future research. As noted earlier, the model developed here ignores store choice, assuming that the consumer is already in the store and ready to make purchases. The model also does not identify the particular product chosen when a category is selected nor provide an estimate of purchase volume. Both these issues can be addressed in the conditional choice framework, but would require the development of a nested probability structure using different types of conditional choice distributions. These extensions are potentially very important because they provide alternative perspectives on the market basket choice decision.
Finally, as the number of categories becomes large, the approach taken in our research will clearly become infeasible. A typical retailer in the United States carries approximately 31,000 items, divided into 600 product categories (Kahn and McAlister, 1997). Any attempt to build a choice model that explicitly enumerates the 2[sup 600] = 10[sup 181]possible baskets will obviously fail. However, another route is open to the researcher. It is possible to estimate each of the full conditional models individually (with side conditions to ensure mutual consistency) and then to use Markov Chain Monte Carlo simulation methodologies to forecast the full market basket distribution (see, e.g., Gilks et al., 1996; Gelman et al., 1996). This general procedure (which effectively reduces a 2[sup N]-sized problem to an N-sized problem) may allow the development of a practical forecasting tool for large market baskets.
Acknowledgment: The authors thank Professor Andrew Mitchell, Director of the Canadian Centre for Marketing Information Technologies, for providing access to the data used in this research. The authors also thank Greg Allenby, V. Srinivasan, Osnat Stamer and Doyle Weiss for many helpful comments. This research was supported by the College of Business Summer Grant Program and by a special grant from Mr. Robert Jensen to the College of Business.
(n1.) To emphasize the simplicity of model calibration for the applied researcher, we capture all consumer heterogeneity using observed variables for category loyalty and interpurchase time. More sophisticated statistical approaches--such as latent class models (Kamakura and Russell, 1987) and random coefficient models (Jain, Vilcassim and Chintagunta, 1994)--would be required to represent unobserved consumer heterogeneity.
(n2.) The tractability of the multivariate logistic distribution is a key feature of the methodology developed here. In contrast, a market basket model based upon the multivariate probit (Manchanda et al., 1999) requires the use of Markov Chain Monte Carlo simulation techniques to evaluate the high dimensional integral defining purchase probability.
(n3.) The superiority of the Full Cross Effects model over the Independence model can also be shown using the classical likelihood ratio test.
(n4.) The importance of the cross-category effects can be measured using the rho[sup 2] statistic. This statistic is defined as (LL(base)-LL(m)/LL(base) where LL(m) is the log likelihood of the model under consideration and LL(base) is the log likelihood of a base model which assumes that all baskets are equally likely. The value of rho[sup 2] runs between zero and one, with higher values indicating better fit. Using information on fit in the holdout data sample, we find the following values: Independence = .316, Simple Cross Effects = .324 and Full Cross Effects = .327. Evidently, cross-category effects are present in paper goods basket choice behavior, but the incremental improvement in fit due to these effects is small.
Legend for Chart: B - Values of mu(b,k,t) Category 1 Present in Basket (X1 = 1) C - Values of mu(b,k,t) Category 1 Absent from Basket (X1 = 0) A B C Category 2 present in basket (X2 = 1) beta[sub 1] + MIX[sub 1] + beta[sub 2] + MIX[sub 2] + theta[sub 12] beta[sub 2] + MIX[sub 2] Category 2 absent from basket (X2 = 0) beta[sub 1] + MIX[sub 1] 0 Note: For expositional reasons, this table ignores consumer specific effects. The probability of selecting a given market basket b equals exp[mu(b,k,t)]/{Sigma[sub b, *] exp[mu(b[sup *], k,t]} where the summation runs over all possible baskets b[sup *].
Legend for Chart: B - Paper Towels C - Toilet Tissue D - Facial Tissue E - Paper Napkins A B C D E Calibration Data Category loyalty -.7555 -.9913 -.8054 (.4486) (.5794) (.5296) -1.941 (.7332) Time since last purchase 1.313 1.406 1.408 (log weeks) (.6643) (.7354) (.7015) 1.838 (.7703) Category price index 1.783 1.1 87 .716311 (dollars per equivalent unit) (.1960) (.2941) (.0768) 2.006 (.1476) Holdout Data Category loyalty -.7458 -1.177 -.8141 (.4214) (.6717) (.4736) -2.429 (.9542) Time since last purchase 1.434 1.805 1.5311 (log weeks) (.7160) (.7126) (.7704) 2.593 (.8474) Category price index 1.723 1.133 .6911 (dollars per equivalent unit) (.1894) (.2804) (.0711) 2.037 (.1293) Note: Calibration data consists of 2,578 baskets of 169 households over a 41 weeks period. Holdout data consists of 822 baskets of 108 households over a subsequent 30 week period. Differences in category loyalty across these data periods are due to the fact that 61 households have no data during the holdout weeks. Standard deviations are shown in parentheses. See text for variable definitions.
Legend for Chart: A - Basket Code B - Paper Towels C - Toilet Paper D - Facial Tissue E - Paper Napkins F - Basket Size G - Calibration Data (% of baskets) H - Holdout Data (% of baskets) A B C D E F G H P 1 0 0 0 1 20.8 28.7 T 0 1 0 0 1 16.7 14.6 F 0 0 1 0 1 17.6 21.5 N 0 0 0 1 1 5.7 3.9 PT 1 1 0 0 2 8.7 6.0 PF 1 0 1 0 2 7.5 11.0 PN 1 0 0 1 2 1.7 1.1 TF 0 1 1 0 2 6.7 5.0 TN 0 1 0 1 2 2.1 0.9 FN 0 0 1 1 2 1.4 0.9 PTF 1 1 1 0 3 6.2 4.3 PTN 1 1 0 1 3 1.8 0.9 PFN 1 0 1 1 3 0.8 0.5 TFN 0 1 1 1 3 1.2 0.2 PTFN 1 1 1 1 4 1.3 0.7 Total Number of Baskets 2578 822 Note: P = paper towels, T = toilet paper, F = facial tissue, and N = paper napkins. Basket size is the total number of categories in the basket. The null basket (no categories purchased) is not included because null basket purchases were not recorded in the data collection process. Basket codes correspond to Figure 1.
Legend for Chart: A - Model Code B - Model Description C - Parm D - Calibration Log Likelihood E - Bayesian Information Criterion F - Holdout Log Likelihood A B C D E F Multivariate Logistic Models C1 Independence (no cross effects) 16 -5,170.64 10,466.96 -1,521.98 C2 Simple cross effects (delta[sub ij] only) 22 -5,124.49 10,421.78 -1,505.23 C3 Full cross effects (phi SIZE[sub k] + delta[sub ij]) 23 -5,052.61 10,285.88(*) -1,498.74(*) Benchmark Models B1 Model C1 + prices of all categories in main effects 28 -5,158.31 10,536.55 -1,528.44 B2 Model C1 + prices of all categories in cross effects 18 -5,176.71 10,389.42 -1,523.18 B3 Model C3 + prices of all categories in main effects 35 -5,042.54 10,360.00 -1,505.73 B4 Model C3 + prices of all categories in cross effects 29 -5,045.67 10,319.13 -1,502.31 Note: Parm is number of parameters. Asterisk denotes the best model according to the Bayesian Information Criterion (BIC) and Holdout Log Likelihood (HLL). The BIC is based upon the fit to the calibration data, while the HLL is based upon the fit to the holdout data. For the BIC, the best model has the smallest BIC value. For the HLL, the best model has the largest HLL value.
Legend for Chart: B - Paper Towels C - Toilet Tissue D - Facial Tissue E - Paper Napkins A C B D E Base Level Parameters Intercept .3100(b) .4750(b) (.1683) (.2734) -.4115(b) .2500 (.2139) (.6424) Category loyalty 1.667(a) 1.834(a) (.1108) (.1305) 1.540(a) 1.849(a) (.1086) (.1383) Time since last purchase .1837(a) .3097(a) (.0712) (.0701) .1694(a) .8843(a) (.0672) (.0946) Category price index -.5485(a) -.7300(a) (.2014) (.3761) -.9571(a) -1.240(b) (.4123) (.7360) Cross Effect Parameters (phi and delta[sub ij]) Basket size loyalty (phi) .6421(a) .6421(a) (.0592) (.0592) .6421(a) .6421(a) (.0592) (.0592) Paper towels cross effects (delta[sub ij]) -.6568(a) -- (.1807) -.8635(a) -1.174(a) (.1800) (.1970) Toilet tissue cross effects (delta[sub ij]) -- -.6568(a) (.1807) -.7633(a) -.7249(a) (.1787) (.1961) Facial tissue cross effects (delta[sub ij]) -.7633(a) -.8635(a) (.1787) (.1800) -- -1.373(a) (.1990) Paper napkins cross effects (delta[sub ij]) -.7249(a) -1.174(a) (.1961) (.1970) -1.373(a) -- (.1990) Note: The basket size loyalty coefficient is not category specific. Only one coefficient is estimated for the model. The delta[sub ij] cross effect parameters are constrained to be symmetrical. Standard errors of parameters are shown in parentheses. Statistical significance is denoted as (a) 05 level or better) and as (b) (.10 level or better).
Legend for Chart: B - Paper Towels C - Toilet Tissue D - Facial Tissue E - Paper Napkins A B C D E Paper towels -- .3192(a) .1125 .1975 (.1205) (.1200) (.1408) Toilet tissue .3192(a) -- .2128(b) .2511(b) (.1205) (.1183) (.1413) Facial tissue .1125 .2128(b) -- -.3972(a) (.1200) (.1183) (.1442) Paper napkins .1975 .2511(b) -.3972(a) -- (.1408) (.1413) (.1442) Note: Values shown are theta[sub jk] = delta [sub ij] + phi SIZE[sub k] where SIZE[sub k] is set to the mean number of categories per trip across all households (SIZE[sub k] = 1.54). The standard errors shown in parentheses are inferred from the results reported in Table 5. Statistical significance is denoted as (a) (.05 level or better) and as (b) (.10 level or better).
Legend for Chart: B - Paper Towels C - Toilet Tissue D - Facial Tissue E - Paper Napkins A B C D E Paper towels -.416 -.021 -.018 .005 Toilet tissue -.031 -.308 -.027 -.019 Facial tissue -.016 -.017 -.581 .018 Paper napkins .009 -.024 .036 -.939 Note: Matrix displays the percentage change in the choice share of the row category with respect to one percent increase in the price of the column category. Values in the table were computed using the parameters of Table 5 and the aggregate elasticity expressions in Appendix B.
Legend for Chart: B - Paper Towels C - Toilet Tissue D - Facial Tissue E - Paper Napkins A B C D E Small Basket Consumers Paper Towels -.505 -.008 .007 .021 Toilet Tissue -.012 -.386 .004 .004 Facial Tissue .006 -.002 -.683 .034 Paper Napkins .036 -.005 .068 -1.034 Large Basket Consumers Paper Towels -.374 -.028 -.030 .002 Toilet Tissue -.038 -.275 -.037 .026 Facial Tissue -.026 -.023 -.531 .010 Paper Napkins -.004 -.033 .020 .893 Note: Matrices display the percentage change in the aggregate choice share of the row category with respect to one percent increase in the price of the column category. The median household buys 1.45 paper goods categories per trip. Households below the median are classified as "Small Basket Consumers." Households above the median are classified as "Large Basket Consumers."
Bell, David R. and James M. Lattin (1998), "Shopping Behavior and Consumer Response to Retail Price Format: Why 'Large Basket' Shoppers Prefer EDLP," Marketing Science, 17(1), 66-88.
Besag, Julian (1974), "Spatial Interaction and the Statistical Analysis of Lattice Systems," Journal of the Royal Statistical Society B, 36, 192-236.
Blattberg, Robert C. (1989), "Learning How the Market Works," Pp. 13-16 in Brace Weinberg (Ed.), Building an Information Strategy for Scanner Data, Report 89-121, Cambridge, MA: Marketing Science Institute.
Blattberg, Robert C. and Scott A. Neslin (1990), Sales Promotion: Concepts, Methods and Strategies, Englewood Cliffs, NJ: Prentice-Hall.
Bodapati, Anand V. and V. Srinivasan (1999), "The Impact of Out-of-Store Advertising on Store Sales," Working Paper, Northwestern University.
Brand, Estelle and Rob Gerristen (1998), "Association and Sequencing," DBMS Data Mining Solutions Supplement, July, http://www.dbmsmag.com/9807m03.html.
Bucklin, Randolph E., Gary J. Russell, and V. Srinivasan (1998), "A Relationship Between Price Elasticities and Brand Switching Probabilities," Journal of Marketing Research, 35(February), 99-113.
Catalina Marketing Corporation (1997). "Checkout Coupon: Now, Customize Incentives Based on Actual Purchase Behavior," Catalina Marketing Corporate Homepage on the World Wide Web, http: \\www.catmktg.com\rodcpn.htm.
Cooper, Lee G. (1988), "Competitive Maps: The Structure Underlying Asymmetric Cross Elasticities," Management Science, 34(June) 707-723.
Cox, D. R. (1972), "The Analysis of Multivariate Binary Data," Applied Statistics (Journal of the Royal Statistical Society, Series C), 21(2), 113-120.
Cressie, Noel A. C. (1993), Statistics for Spatial Data, New York: John Wiley and Sons.
Erdem, Tulin (1998), "An Empirical Analysis of Umbrella Branding," Journal of Marketing Research, 35(August), 339-351.
Farquhar, Peter H. and Vithala R. Rao (1976), "A Balance Model for Evaluating Subsets of Multiattributed Items," Management Science, 27(5), 528-539.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin (1996), Bayesian Data Analysis, London: Chapman and Hall.
Gilks, W. R., S. Richardson, and D. J. Speighelhalter (1996), Markov Chain Monte Carlo in Practice, London: Chapman and Hall.
Grover, Rajiv and V. Srinivasan (1992), "Evaluating the Multiple Effects of Retail Promotions on Brand Loyal and Brand Switching Segments," Journal of Marketing Research, 29(February), 76-89.
Gupta, Sunil (1988), "Impact of Sales Promotions on When, What and How Much to Buy," Journal of Marketing Research, 25, 342-355.
Harlam, Bari A. and Leonard M. Lodish (1995), "Modeling Consumers' Choices of Multiple Items," Journal of Marketing Research, 32(November), 392-403.
Jain, Dipak, Naufel Vilcassim, and Pradeep Chintagunta (1994), "A Random Coefficients Logit Brand-Choice Model Applied to Panel Data," Journal of Business and Economic Statistics, 12(3), 317-328.
Kahn, Barbara E. and Leigh McAlister (1997), Grocery Revolution: The New Focus on the Consumer, Reading, MA: Addison-Weley.
Kamakura, Wagner A. and Gary J. Russell (1989), "A Probabilistic Choice Model for Market Segmentation and Elasticity Structure," Journal of Marketing Research, 26(November), 379390.
Kamakura, Wagner A., Sridhar Ramaswami, and Rajendra K. Srivastava (1991), "Qualification of Prospects for Cross-Selling in the Financial Services Industry," International Journal of Research in Marketing, 8, 329-349.
Manchanda, Punee, Asim Ansari, and Sunil Gupta (1999), "The 'Shopping Basket:' A Model for Multi-Category Purchase Incidence Decisions," Marketing Science, 18(2), 95-114.
McAlister, Leigh (1979), "Choosing Multiple Items from a Product Class," Journal of Consumer Research, 6(December), 213-224.
Russell, Gary J. and Wagner A. Kamakura (1994), "Understanding Brand Price Competition with Micro and Macro Scanner Data," Journal of Marketing Research, 31(May), 289-303.
Russell, Gary J. and Wagner A. Kamakura (1997), "Modeling Multiple Category Brand Preference with Household Basket Data," Journal of Retailing, 73(Winter), 439-461.
Russell, Gary J., S. Ratneshwar, Alan Shocker et al. (1999), "Multiple Category Decision Making: Review and Synthesis," Marketing Letters, 10(July), 317-330.
The factorization theorem of Besag (1974) allows the researcher to verify consistency of the full conditionals and to derive the form of the implied joint distribution. Let X = {x(1),x(2),...,x(N)} be any basket of category choices. Let f(X) denote the joint distribution of the random variables x(1),x(2},...,x(N). (This can be interpreted as the probability of observing a basket with contents X.) Define the vector 0 = {0, 0,...,0} as the null basket and let f(0) be the probability associated with the null basket. For any permutation of the category labels, the joint (market basket) distribution is implicitly defined by
(A1) f(X)/f(0) = k(1)[sup *]k(2)[sup *]...[sup *]k(N)
where k(i) = g[sup i](x(i))/g[sup i](0) depends upon the full conditional distributions
(A2) g[sup i](.) = f(.|x(1),...,x(i - 1), 0,...,0).
An explicit expression for f(X) can be worked out using the fact that the summation of (A1) over all market baskets X is equal to 1/f(0).
Because the order in which the categories are arranged in X is arbitrary, the full distribution will not be unique unless all permutations of category labels yield the same joint distriution according to Equations (Al) and (A2). As Besag (1974) notes, this generally requires the researcher to place restrictions on the form of the full conditional distributions. In the present application, symmetry of the cross effects theta[sub ij] ensures that the form of the joint distribution is invariant with respect to label permutations.
Expressions for price elasticities depend upon the specification of the market basket model. Following the form of the cross-effects model discussed in the text, we assume that the probability that consumer k buys basket b at time t is given by
(B1) Pr(B(k,t) = b) = exp{Mu(b,k,t)}/SIGMA[sub b, sup *] exp{Mu(b[sup *],k,t)}
where
(B2) Mu(b,k,t) = SIGMA[sub i]beta[sub i](i,b) + SIGMA[sub i]HH[sub ikt] X(i,b) + SIGMA[sub i]MIX[sub ikt] X(i,b) + SIGMA[sub i<j] theta[sub ijk]X(i,b)X(j,b)
is the implied utility of a basket with contents {X(1,b), X(2,b),...,X(N,b)}. Here X(i,b) is a binary 0-1 variable that takes the value 1 when category i is present in basket b. We assume that price enters into the utility expression (B2) only as MIX[sub ikt] = gamma[sub i] log[PRICE[sub ikt]]. All other terms in (B2)--including the symmetrical cross effects theta[sub ijk] = phi[SIZE[sub k]] + delta[sub ij]--do not depend on price.
Consumer-Level Price Elasticities
We consider summations of exp{ Mu(b,k,t)} over various subsets of market baskets. The notation SB(all)[sub kt] = SIGMA[sub b]exp{ Mu(b[sup *],k,t)} denotes a summation over all possible baskets (including the null basket). SB(i)[sub kt] denotes the summation of exp{Mu(b,k,t)} over all baskets b containing category i. In addition, SB(i,j)[sub kt] denotes the summation of exp{Mu(b,k,t)} over all baskets b containing both category i and category j.
It is important to understand that price elasticities are defined relative to product categories, not with respect to market baskets. Define the probability of a consumer buying category i on a shopping trip as
(B3) DELTA (i)[sub kt] = SB(i)[sub kt]/SB(all)[sub kt]
Formally, this is the probability that the consumer chooses a basket that contains category i, regardless of which additional categories are present. Analogously, the probability that the consumer chooses a basket containing both category i and category j is
(B4) DELTA(i,j)[sub kt] = SB(i,j][sub kt]/SB(all)[sub kt]
In words, this is the probability that the selected basket contains both i and j, regardless of which additional categories are present. These expressions emphasize the fact that the basket model can be thought of as a probability distribution over product category choices--not just a probability distribution over market baskets.
Using these expressions, we define the cross-category price elasticity E(i,j)[sub kt] as the percentage change in the probability of selecting category i with respect to a one percent change in the price of category j. That is, E(i,j)[sub kt] = Differential (log DELTA(i)[sub kt]/Differential(log PRICE[sub jkt]). Using this definition, Equations (B 1) through (B4) imply that
(B5) E(i,i)[sub kt] = gamma[sub i](1 - DELTA(i)[sub kt])
(B6) E(i,j)[sub kt] = gamma[sub j]DELTA(j)[sub kt][S(i,j][sub kt] - 1], i is not equal to j
where S(i,j)[sub kt] = DELTA(i,j)[sub kt]DELTA(i)[sub kt]DELTA(j)[sub kt] is a measure of the association across the two product categories. In these expressions, we expect gamma[sub i] and gamma[sub j] to be negative. Accordingly, own elasticities are always negative. Cross elasticities can be positive or negative, depending upon the value of [S(i,j)[sub kt] -- 1].
Aggregate Elasticities
To obtain aggregate price elasticities, we define the aggregate choice share of category i as the mean DELTA(i)[sub t] = [SIGMA[sub k] DELTA(i)[sub kt]]/N where N is the total number of households. This expression is the expected choice share for category i in week t across the entire market. We also define the aggregate cross-category price elasticity as E(i,j)[sub t] = Differential(log Differential(i)[sub t])/Differential(log PRICE[sub jt]) where PRICE[sub jt] is interpreted as the price of category j in week t for each consumer in mm. Using Equations (B5) and (B6), we can derive the aggregate cross-price elasticities
(B7) E(i,i)[sub t] = gamma[sub i][SIGMA[sub k] DELTA(i)[sub kt](1 -DELTA(i)[sub kt)]/[SIGMA[sub k] DELTA(i)[sub kt]]
(B8) E(i,j)[sub j] = gamma[sub j][SIGMA[sub k] {DELTA(i,j)[sub kt] -DELTA(i)[sub kt)]/DELTA(j)[sub kt]}]/[SIGMA[sub k] DELTA(i)[sub kt]], i is not equal to j
for a particular time point t. The differences between these expressions and the individual price elasticities in (B5) and (B6) are due to aggregation over heterogeneous consumers.
An alternative procedure is to define elasticities with respect to the overall choice shares DELTA(i) = [SIGMA[sub k]SIGMA[sub t] DELTA(i)[sub kt]]/N[sup *] where N[sup *] is the number of choice sets found in a particular data set. (N[sup *] equals N times the average number of choices occasions per household.) These elasticities can be regarded as the typical aggregate elasticities that would be found in the market during a randomly selected week. Using this definition and Equation (B5) and (B6), we obtain the aggregate cross-price elasticities
(B9) E(i,i)[sub t] = gamma[sub i][SIGMA[sub k]SIGMA[sub t] DELTA(i)[sub kt](1 - DELTA(i)[sub kt)]/[SIGMA[sub k]SIGMA[sub t] DELTA(i)[sub kt]]
(B10) E(i,j)[sub j] = gamma[sub j][SIGMA[sub k]SIGMA[sub t] {DELTA(i,j)[sub kt] - DELTA(i)[sub kt)]/DELTA(j)[sub kt]}]/[SIGMA[sub k]SIGMA[sub t] DELTA(i)[sub kt]], i is not equal to j
In the text, we report these aggregate elasticities in the discussion of cross-category price effects. It should be noted that if all cross-effect parameters theta[sub ijk] in (B2) are zero, then all cross-price elasticities in (B10) will also be zero.
The aggregation procedures adopted here are similar in spirit to procedures advocated by Russell and Kamakura (1994) and Bucklin, Russell, and Srinivasan (1998). However, the expressions shown here are specific to the market basket model in (B1) and (B2).
~~~~~~~~
By Gary J. Russell, University of Iowa and Ann Petersen, Drake University
Gary J. Russell is Professor of Marketing at the Tippie College of Business,
University of Iowa, Iowa City, Iowa 52242-1000, (e-mail:
gary-j-mssell@uiowa.edu). Ann Petersen is Visiting Assistant Professor, College
of Business, Drake University, Des Moines, Iowa 50311-4505, (email:
apetersn@blue.weeg.uiowa.edu).
Title: | Case Studies in Bayesian Statistics, Vol. IV (Book Review). |
Subject(s): | |
Source: | |
Abstract: | Reviews the book `Case Studies in Bayesian Statistics,' vol. 4, edited by Constantine Gatsonis, Robert E. Kass, Bradley Carlin, Alicia Carriquiry, Andrew Gellman, Isabella Verdinelli and Mike West. |
AN: | 3409007 |
ISSN: | 0040-1706 |
Database: | Business Source Premier |
edited by Constantine GATSONIS, Robert E. KASS, Bradley CARLIN, Alicia CARRIQUIRY, Andrew GELMAN, Isabella VERDINELLI, and Mike WEST, New York: Springer-Verlag, 1999, ISBN 0-387-98640-5, xiii + 427 pp., $44.95.
Collections of case studies in the physical and engineering sciences are rarely published. One recent example is that of Peck (1998), reported by Ziegel (1999). Here is a proceedings volume from the fourth in a series of workshops on case studies in Bayesian statistics. For various reasons, workshop participants are mostly academicians, and the case studies reflect their work, not industrial practices. This particular workshop was held at Carnegie Mellon University. The previous workshop and its proceedings volume, Gatsonis et al. (1997), were reported by Ziegel (1998).
The case studies in this book do not enhance the available literature for the physical sciences because mostly these applications are biomedical. There are four invited papers with multiple discussants. These consume two-thirds of the book. The first, "Modeling Customer Survey Data," is a nonbiomedical paper from Lucent Technology. The topics for the other three case studies are analysis of spatio-temporal patterns in neuroimaging, modeling genetic testing data for susceptibility to breast cancer, and characterizing variability in absorption for drug development.
The remaining nine papers were chosen from among the contributed poster presentations. Seven of these are biomedical applications. The other two case studies are environmental applications. The first involves a model for evaluating designs for the placement of rainfall monitoring stations. The other presents the use of a spatial model for wind data.
Much of the literature on Bayesian applications appears in various collections of papers. Another similar series is most recently represented by Bernardo, Berger, Dawid, and Smith (1996). See Gelman, Carlin, Stern, and Rubin (1995) for a recent Bayesian textbook.
Bernardo, J., Berger, J., Dawid, A., and Smith, A. (eds.) (1996), Bayesian Statistics 5, New York: Oxford University Press.
Gatsonis, C., Hodges, J., Kass, R., McCulloch, R., Rossi, P., and Singpurwalla, N. (eds.) (1997), Case Studies in Bayesian Statistics, Volume III, New York: Springer-Verlag.
Gelman, A., Carlin, J., Stern, H., and Rubin, D. (1995), Bayesian Data Analysis, London: Chapman and Hall.
Peck, R., Haugh, L., and Goodman, A. (1998), Statistical Case Studies, Philadelphia: ASA-SIAM.
Ziegel, E. (1998), Editor's Report on Case Studies in Bayesian Statistics, Volume III, by C. Gatsonis, J. Hodges, R. Kass, R. McCulloch, P. Rossi, and N. Singpurwalla, Technometrics, 40, 84.
----- (1999), Editor's Report on Statistical Case Studies, by R. Peck, L. Haugh, and A. Goodman, Technometrics, 41, 382.
Books listed here have been assigned for review in the past quarter. Publication of their reviews or reports generally would occur within the next four issues of the journal. Persons interested in reviewing specific books must notify the editor by the publication date for the book. Persons interested in being reviewers should contact the editor by electronic mail (ziegeler@bp.com).
The Analysis of Variance, by Hardeo Sahai and Mohammed I. Ageel, Birkhauser
Applied Mixed Models in Medicine, by Helen Brown and Robin Prescott, Wiley
Basic Linear Geostatistics, by Margaret Armstrong, Springer-Verlag
The Basic Practice of Statistics (2nd ed.), by David S. Moore, W. H. Freeman
Bayesian Inference in Wavelet-Based Models, edited by Peter Muller and Brani Vidakovic, Springer-Verlag
Bootstrap Methods, by Michael R. Chernick, Wiley
Chance Encounters, by Christopher J. Wild and George A. F. Seber, Wiley
Chance Rules, by Brian S. Everitt, Springer-Verlag
Comparative Statistical Inference (3rd ed.), by Vic Barnett, Wiley
The Complete Guide to Six Sigma, by Thomas Pyzdek, Quality Publishing
Computer-Assisted Analysis of Mixtures and Applications, by Dankmar Bohning, Chapman & Hall/CRC
Conditional Specification of Statistical Models, by Barry C. Arnold, Enrique Castillo, and Jose Maria Sarabia, Springer-Verlag
The Desk Reference of Statistical Quality Methods, by Mark L. Crossley, ASQ Quality Press
Discrete-time Dynamic Models, by Ronald K. Pearson, Oxford University Press
Doing Statistics for Business With Excel, by Marilyn K. Pelosi and Theresa M. Sandifer, Wiley
Essential Wavelets for Statistical Applications and Data Analysis, by R. Todd Ogden, Birkhauser Boston
Flood Frequency Analysis, by A. Ramachandra Rao and Khaled H. Hamed, CRC Press
Geostatistics for Engineers and Earth Scientists, by Ricardo A. Olea, Kluwer
The Grammar of Graphics, by Leland Wilkinson, Springer-Verlag
Improving Performance Through Statistical Thinking, by Galen C. Britz, Donald W. Emerling, Lynne B. Hare, Roger W. Hoerl, Stuart J. Janis, and Janice E. Shade, ASQ Quality Press
Intelligent Data Analysis, edited by Michael Berthold and David J. Hand, Springer-Verlag
Introduction to the Practice of Statistics (3rd ed.), by David S. Moore and George P. McCabe, W. H. Freeman
Linear Models in Statistics, by Alvin C. Rencher, Wiley
Models for Repeated Measurements (2nd ed.), by J. K. Lindsey, Oxford University Press
Modern Applied Statistics With S-PLUS (3rd ed.), by W. N. Venables and B. D. Ripley, Springer-Verlag
Regression Analysis by Example (3rd ed.), by Samprit Chatterjee, Ali S. Hadi, and Bertram Price, Wiley
Root Cause Analysis, by Bjorn Andersen and Tom Fagerhaug, ASQ Quality Press
Simulation: A Modeler's Approach, by James R. Thompson, Wiley
Six Sigma, by Mikel Harry and Richard Schroeder, Doubleday
Statistical Aspects of Health and the Environment, edited by Vic Barnett, Alfred Stein, and K. Feridun Turkman, Wiley
Statistical Methods for Quality Improvement (2nd ed.), by Thomas P. Ryan, Wiley
Statistical Modelling Using GENSTAT Registered Trademark, by K. J. McConway, M. C. Jones, and P. C. Taylor, Arnold/Oxford University Press
Statistical Process Analysis, by Layth C. Alwan, Irwin/McGraw-Hill
Statistical Process Control in Industry, by R. J. M. M. Does, K. C. B. Roes, and A. Trip, Kluwer
Statistical Process Monitoring and Optimization, edited by Sung H. Park and G. Geoffrey Vining, Marcel Dekker
Statistics and Experimental Design for Toxicologists (3rd ed.), by Shayne C. Gad, CRC Press
Subsampling, by Dimitris N. Politis, Joseph P. Romano, and Michael Wolf,
Springer-Verlag
Title: | A Bayesian Approach to Combining Information From a Census, a Coverage Measurement Survey, and Demographic Analysis. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Demographic analysis of data on births, deaths, and migration and coverage measurement surveys that use capture-recapture methods have both been used to assess U.S. Census counts. These approaches have established that unadjusted Census counts are seriously flawed for groups such as young and middle-aged African-American men. There is considerable interest in methods that combine information from the Census, coverage measurement surveys, and demographic information to improve Census estimates of the population. This article describes a number of models that have been proposed to accomplish this synthesis when the demographic information is in the form of sex ratios stratified by age and race. A key difficulty is that methods for combining information require modeling assumptions that are difficult to assess based on fit to the data. We propose some general principles for aiding the choice among alternative models. We then pick a particular model based on these principles and imbed it within a more comprehensive Bayesian model for counts in poststrata of the population. Our Bayesian approach provides a principled solution to the existence of negative estimated counts in some subpopulations; provides for smoothing of estimates across poststrata, reducing the problem of isolated outlying adjustments; allows a test of whether negative cell counts are due to sampling variability or more egregious problems such as bias in Census or coverage measurement survey counts; and can be easily extended to provide estimates of precision that incorporate uncertainty in the estimates from demographic analysis and other sources. The model is applied to data for African-American age 30-49 from the 1990 Census, and results are compared with those from existing methods. [ABSTRACT FROM AUTHOR] |
AN: | 3167491 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
Demographic analysis of data on births, deaths, and migration and coverage measurement surveys that use capture-recapture methods have both been used to assess U.S. Census counts. These approaches have established that unadjusted Census counts are seriously flawed for groups such as young and middle-aged African-American men. There is considerable interest in methods that combine information from the Census, coverage measurement surveys, and demographic information to improve Census estimates of the population. This article describes a number of models that have been proposed to accomplish this synthesis when the demographic information is in the form of sex ratios stratified by age and race. A key difficulty is that methods for combining information require modeling assumptions that are difficult to assess based on fit to the data. We propose some general principles for aiding the choice among alternative models. We then pick a particular model based on these principles and imbed it within a more comprehensive Bayesian model for counts in poststrata of the population. Our Bayesian approach provides a principled solution to the existence of negative estimated counts in some subpopulations; provides for smoothing of estimates across poststrata, reducing the problem of isolated outlying adjustments; allows a test of whether negative cell counts are due to sampling variability or more egregious problems such as bias in Census or coverage measurement survey counts; and can be easily extended to provide estimates of precision that incorporate uncertainty in the estimates from demographic analysis and other sources. The model is applied to data for African-American age 30-49 from the 1990 Census, and results are compared with those from existing methods.
KEY WORDS: Gibbs sampling; Model selection; Postenumeration survey; Posterior predictive distribution.
Capture-recapture methods (Seber 1982) and demographic analysis (DA) of data on births, deaths, and migration have been used to estimate the undercount in the U.S. Census (Robinson 1993). These approaches have established that unadjusted Census counts are seriously flawed for groups such as young and middle-aged African-American men. Demographic analysis indicates a 1990 male-female ratio among African-Americans age 30-49 of .91, whereas the published 1990 Census counts indicate a ratio of .86; with imputations and erroneous enumerations removed, the 1990 Census identified only .78 males for every female in this age-race category (Bell et al. 1996). Considerable effort has been devoted during the past decade to develop methods that combine demographic information and capture-recapture analysis to improve Census estimates of the population (Bell 1993; Bell et al. 1996; Choi, Steel, and Skinner 1988; Das Gupta and Robinson 1990; Fay, Passel, and Robinson 1988; Isaki and Schultz 1986; Wolter 1990). Two key difficulties arise: DA typically provides estimates only for national levels of aggregation, and capture-recapture methods require modeling assumptions that are difficult to assess based on fit to the data. In the remainder of this section we review existing research on these problems. Section 2 suggests principles for choosing among capture-recapture models, and Section 3 imbeds these methods within a unified Bayesian model. Section 4 applies the results of Sections 2 and 3 to the U.S. Census data for African-Americans age 30-49.
The goal of the methods described here is to yield adjusted population counts in strata of the population, together with measures of uncertainty. Because the issue of combining information from macro data sources such as administrative records and from micro data sources such as surveys arises in many disparate fields (Raftery, Givens, and Zeh 1995), the methodology described here has potentially broader application than the U.S. Census.
1.1 Coverage Measurement Surveys
Since 1970, the Census has been supplemented by a coverage measurement survey (CMS), a detailed independent enumeration of households in a probability sample of Census blocks conducted immediately after the actual Census. In 1990, this CMS, termed the Post-Enumeration Survey (PES), was conducted in July-September 1990, following the April-June 1990 data collection for the Census. To combine the Census and CMS data, individuals in the Census who were imputed or for whom insufficient information existed to match with the CMS ("imputations") were removed from the Census total. In addition, persons in the Census listing who did not appear in the CMS were rechecked for "erroneously enumeration." Examples of erroneous enumerations included persons whose primary residence was a dormitory, persons enumerated at a vacation home that was not their primary residence, and any other persons not residing at the enumerated dwelling unit as of April 1. The Census count in the sampled blocks minus the imputations and an estimate of erroneous enumerations became the adjusted E sample. Similarly, persons in the CMS--the P sample--were cross-checked against Census records within the sample Census blocks and a set of surrounding blocks and assigned to either an In-Census or Out-Census category. After the counts in the sampled blocks were inflated by the inverse of their probability of selection, a 2 x 2 table (In-/Out-Census; In-/Out-CMS) was formed for males and females (S = M, F) in each of K poststrata (typically defined by geographic area and owner versus renter status) within age-by-race groupings (see Table 1). (For more information about the details of the P and E samples in the 1990 PES, see Hogan 1992.)
Table 1 also gives the "true" but unknown population counts in the kth poststratum for gender S that would be obtained if the CMS had itself been a census and if those missed in both the Census and CMS were known. These population quantities Psi[sub k] = {Psi[sup S, sub kij] : i = 0, 1; j = 0, 1; S = M, F}, where i is the Census enumeration status and j is the CMS enumeration status in the kth poststratum (1 if included, 0 otherwise), are considered unknown parameters and are estimated by the statistical methods to be described.
Table 2 displays the counts {y[sup S, sub k11], y[sup S, sub k01], z[sup S, sub k1.]} and associated estimated sampling errors from the 1990 Census for African-Americans age 30-49, stratified into 12 poststrata. Poststrata 1-6 include those residing in owner-occupied dwelling units in urban area 250,000/+ in the Northeast (1), South (2), Midwest (3), and West (4); owners in urban areas under 250,000 (5); and owners in nonurban areas (6). Poststrata 7-12 include those residing in nonowner (rental) dwelling units in the corresponding geographic areas. The analysis is conducted only on this one age-race "superstratum" because of time and length constraints. A more complete analysis will, of course, apply the methods to the entire U.S. population.
1.2 Model Constraints
A fundamental problem (Bell 1993) is that Table 1 provides only three data elements to estimate the four parameters. Thus constraints must be placed on the parameters to obtain unique parameter estimates. One such constraint is to assume independence of capture and recapture (ICR), that is, that the odds ratios for enumeration in the census and the CMS for sex S in poststratum k, Theta[sup S, sub k] = (Psi[sup S, sub k11]/Psi[sup S, sub k10])/(Psi[sup S, sub k01]/Psi[sup S, sub k00]), are all equal to 1 (Sekar and Deming 1949).
(1) ICR: Theta[sup S, sub k] = 1 Universal quantifier k, S.
As Sekar and Deming point out, the ICR assumption can be violated when either the probabilities of capture and recapture are unequal or when they differ across individuals (unobserved heterogeneity), leading to "correlation bias." Correlation bias tends to lead to an underestimate the undercount, because if it is due to unobserved heterogeneity, then Theta[sub k] > 1, leading to an underestimation of Psi[sup S, sub k00]. This form of bias is indicated by a pattern of implausibly low values of the male-female sex ratio, a value considered highly reliable by demographers (Robinson 1995), when the 1990 Census estimates are adjusted using the ICR model. For example, under the ICR assumption, the estimated male-female ratio among African-Americans age 30-49 is .84, substantially below the .91 DA estimate.
To correct for this bias, several models have been suggested that attribute the low observed male-female ratio to an undercount of males rather than an overcount of females, The estimated number of males in the population is increased to match the overall male-female ratio (Rho = Psi[sup M, sub ...]/Psi[sup F, sub ...]) from DA. The additional males are distributed over the poststrata using a method based on the assumptions of the model. In particular, Wolter (1990) assumed that the odds of appearing in the follow-up survey given that one was enumerated in the Census relative to the odds of appearing in the follow-up survey given that one was not enumerated in the Census are arbitrary for males and equal to 1 for females in a one-stratum design. We call this the fixed odds-ratio (FOR) model (here and elsewhere labels for the models are ours):
FOR: = Theta[sup M, sub k] = Theta
and
(2) Theta[sup F, sub k] = 1 Universal quantifier k.
Bell (1993) extended Wolter's approach by considering multiple poststrata and alternative behavior models. The fixed relative risk (FRR) model assumes a constant relative risk for enumeration in the Census and CMS for males and independence for females:
FRR: = Gamma[sup M, sub k] = Psi[sup M, sub k11]/Psi[sup M, sub k1.]/ Psi[sup M, sub k01]/Psi[sup M, sub k0.] = Gamma[sup M]
and
(3) Theta[sup F, sub k] = 1 Universal quantifier k.
The fixed sex ratio (FSR) model assumes that the sex ratio in the out CMS-out Census cell is constant across strata:
FSR: Omega[sub k] = Psi[sup M, sub k00]/Psi[sup F, sub k00] = Omega
and
(4) Theta[sup F, sub k] = 1 Universal quantifier k.
The generalized behavioral response model (GBR) assumes the probability of being in the coverage measurement survey given that one was not enumerated in the Census divided by the probability of being in the Census is constant across the poststrata for males:
GBR: Lambda[sup M, sub k] = Psi[sup M, sub k01]/Psi[sup M, sub k0.]/ Psi[sup M, sub k1.]/Psi[sup M, sub k..] = Lambda[sup M]
and
(5) Theta[sup F, sub k] = 1 Universal quantifier k.
Das Gupta (Bell et al. 1996) proposed yet another model, which we term the fixed dual-coverage rate (FDCR) model. It assumes that the ratio of the Census-CMS combined coverage rate for males to the Census-CMS coverage rate for females is constant over all poststrata:
FDCR: Chi[sub k] = 1 - Psi[sup M, sub k00]/Psi[sup M, sub k..]/ 1 -Psi[sup F, sub k00]/Psi[sup F, sub k..] = Chi
and
(6) Theta[sup F, sub k] = 1 Universal quantifier k.
All models adjust the fitted counts so that their sum across the poststrata matches the sex ratio Rho from DA. In practice, Rho is estimated from DA within an age-race group, so the models are applied separately to each age-race grouping.
Models (2)-(6) add a single parameter (Theta[sup M], Gamma[sup M], Omega, Lambda[sup M], or Chi) to the ICR model; all are "saturated" and provide an equally good fit to the data. Thus it is difficult to choose among alternative models, although they can yield adjustments with nontrivial differences. This is viewed as a major obstacle to combining the data from the CMS and DA (Bell 1993). Another problem is the existence of negative estimates of persons included in the Census but missed in the CMS, obtained by subtracting those estimated to have been in both the CMS and the Census from the Census total (Psi[sup S, sub k10] = z[sup S, sub k1.] - y[sup S, sub k11]). Bell (1993) determined the maximum likelihood estimates (MLEs) of Psi[sup S, sub k..] under the assumptions (1)-(5), with the constraint that Psi[sup S, sub k10] = max(0, z[sup S, sub k1.] - y[sup S, sub k11]), and that Psi[sup S, sub k11] = y[sup S, sub 11] and Psi[sup S, sub k10] = y[sup S, sub k10] if Psi[sup S, sub k10] > 0 and Psi[sup s, sub k11] = z[sup S, sub k1.] and Psi[sup S, sub k10] = y[sup S, sub k10]z[sup S, sub k1.]/ y[sup S, sub k11] if Psi[sup S, sub k10] = 0. That is, strata with negative estimates of the In-CMS-Out-Census cell have their negative estimates set to 0 and their In CMS column cells multiplied by z[sup S, sub k1.]/y[sup S, sub k11] to maintain the marginal Census counts and, as it turns out, the ICR-based poststratum estimates. Formal justification for this approach is lacking. In addition, it is unclear how to account for uncertainty in the DA sex ratios. This article builds on previous research in the following respects: Section 2 proposes principles by which one might choose among the saturated models (2)-(6), or others. Section 3 provides a unified Bayesian foundation for model fitting that
• addresses the negative cell problem in a principled manner
• provides for smoothing of estimates across poststrata, reducing the problem of isolated outlying adjustments
• allows a test of whether negative cell counts are due to sampling variability or more egregious problems such as bias in erroneous enumerations or imputations in Census counts or in assigning CMS subjects to In-/Out-Census categories
• includes a parameter to control variability of sex ratios across poststrata
• can be extended to provide estimates of precision that incorporate uncertainty in the DA sex ratio estimates.
Five alternative models (2)-(6) for combining CMS and DA data on sex ratios were proposed by Bell (1993) and Das Gupta (Bell et al. 1996), and many others might be envisaged. To simplify and reduce the scope of the model selection problem, we propose six principles for guiding the selection of a model for combining CMS and DA information:
1. Plausibility. The model should imply a plausible description of Census behavior.
• Assessment of this issue requires expert opinion and careful exposition of the model assumptions.
2. Fit. The model should minimize contradiction with available data.
• The "no adjustment" model (i.e., doing nothing) clearly fails this test, as it ignores the sizable body of evidence of differential undercount across demographic groups. A number of alternative models (including those considered by Bell and Das Gupta) provide better fits and hence should be preferred under this principle.
3. Prediction. The model should provide plausible predictions of key unobserved quantities; for example, undercount rates should be within limits deemed reasonable.
• Models may yield implausible outlying predictors for certain cells. Although a consistent pattern of unlikely predictions is evidence that the model is not appropriate, in isolated cases modifications that control the extent of adjustments might be considered. These may be achieved informally by ad hoc adjustments, or more formally by a Bayesian analysis based on prior distributions that limit the size of the adjustments. Section 3 includes an example of the Bayesian approach. Because the Bayesian approach is no panacea--if noninformative prior distributions give unreasonable model predictions, then constraining them via informative priors is only hiding the problem--we consider model fit via posterior predictive distributions in Section 4.2.
4. ICR inclusion. The model should include the ICR model, which assumes zero correlation bias within poststrata, as a possibility.
• It is harder to defend this principle as necessary on scientific grounds, but given the widespread adoption of the ICR model for CMS problems, it seems reasonable to restrict attention to the class of models that include that model for a particular choice of parameters. It is also in keeping with statistical parsimony; without evidence of correlation bias, we would accept the independence model.
5. Stability. If alternative competing models are not distinguished on the basis of 1-4, then models that yield more stable estimates of key estimands should be favored over models that yield less stable estimates.
• If little can be concluded about the relative biases of competing models, then a model that yields estimates with fewer potential outliers is to be preferred.
6. Conservatism. If alternative competing models are not distinguished on the basis of 1-5, then models that are more conservative with respect to undercount adjustment should be favored over models that are less conservative.
• The principle of "one adjusts to the minimum extent necessary to be consistent with the data" is pragmatic rather than scientific. Given the sentiment against any type of adjustment in some quarters, the goal of adjusting the Census counts to the minimal extent needed for consistency with DA and CMS data seems appropriate.
Of the six models discussed, only two--the FOR and FRR models--are both saturated and satisfy the ICR inclusion rule. Both models have simple interpretations and appear to satisfy both the plausibility and the prediction principles. Comparing the MLEs of the undercount rates under the various models, Bell (1993) determined that the FRR model yields somewhat more stable and conservative results than the FOR model. Hence we highlight this model in the remainder of the article.
This section outlines a comprehensive model for the underlying 8K population counts in the CMS tables that incorporates information about sex ratios from DA and eases prior specifications. For the FRR model (3), we reparameterize the eight population counts in poststratum k,
Psi[sub k] = {Psi[sup S, sub kij] : i = 0, 1; j = 0, 1; S = M, F} as
Psi[sup *, sub k] = (Psi[sub k..], Rho[sub k], Delta[sup M, sub k], Delta[sup F, sub k], Phi[sup M, sub k], Gamma[sup M, sub k], Gamma[sup F, sub k]),
where Psi[sub k..] is the total population count in poststratum k; Rho[sub k] = (Psi[sup M, sub k..])/(Psi[sup F[sub k..]]), the sex ratio (Psi[sup M, sub k..] + Psi[sup F, sub k..] = Psi[sub k..]); Delta[sup S, sub k] = (Psi[sup S, sub k1.])/(Psi[sup S, sub k..]), the census undercount proportion for sex S; Phi[sup S, sub k] = (Psi[sup S, sub k11])/(Psi[sup S, sub k1.]), the proportion of census cases enumerated in the CMS for sex S; and Gamma[sup S, sub k] = (Psi[sup S, sub k11]/Psi[sup S, sub k1.])/(Psi[sup S, sub k01]/Psi[sup S, sub k0.]), the relative proportion of census and noncensus cases enumerated in the CMS for sex S.
The foregoing parameterization is particularly useful for the FRR model; other choices of parameterizations are more natural for other models. For example, under the FOR assumption (2), we would replace the Gamma[sup S, sub k] with Theta[sup S, sub k] = (Psi[sup S, sub k11]/Psi[sup S, sub k10])/(Psi[sup S, sub k01]/Psi[sup S, sub k00]), whereas under the ICR assumption (1), we would force Gamma[sup S, sub k] = Theta[sup S, sub k] = 1 for all S and k. Priors can be chosen that avoid negative cell count estimates, borrow strength across poststrata to reduce outlying predictions, and moderate extreme undercount adjustments; for example, a proper prior for Gamma[sup S, sub k] can prevent the risk ratio of enumeration from reaching unreasonable extremes. In our analysis of the FRR model, we select the following independent priors for each parameter:
• p(Psi[sub k..]) proportional to 1, a flat prior corresponding to our lack of knowledge about the total population counts in each poststratum.
• Rho[sub k] is similar to N(Rho, Sigma[sup 2]) subject to the constraint that Sigma[sub k]w[sub k]Rho[sub k] = Rho, where w[sub k] = (Psi[sub k..])/(Sigma[sub k]Psi[sub k..]) and Rho is the DA-estimated nationwide sex ratio. Variation in the sex ratios across poststrata is modeled via the parameter Sigma[sup 2]. The normal distribution is chosen for computational convenience.
• Delta[sup S, sub k] is similar to BETA(a[sup S], b[sup S]) and Phi[sup S, sub k] is similar to BETA(c[sup S], d[sup S]), beta priors that smooth the proportion of Census undercounts and the proportion of Census cases enumerated in the CMS across poststrata independently for each sex and provide a support of [0, 1] for these proportions.
• Gamma[sup M, sub k] = Gamma[sup M] is similar to GAMMA(Alpha, Beta) for all k; Gamma[sup F, sub k] = 1 for all k. These priors assume that the relative proportion of Census and nonCensus cases enumerated in the CMS for each poststratum is a constant across poststrata (likely greater than 1) for all males and is constant and known to be equal to 1 (under the independence assumption) for females.
In addition, we assume that
y[sup S, sub k11]|Psi[sup S, sub k11] is similar to N(Psi[sup S, sub k11], (Tau[sup S, sub k11])[sup 2]),
y[sup S, sub k01]|Psi[sup S, sub k01] is similar to N(Psi[sup S, sub k01], (Tau[sup S, sub k01])[sup 2]),
and
z[sup S, sub k1.]/Psi[sup S, sub k1.] is similar to N(Psi[sup S, sub k1.], (u[sup S, sub k1.])[sup 2])
where y[sup S, sub k1.]/Psi[sup S, sub k01], and z[sup S, sub k1.] are all independent. The marginal covariance between y[sup S, sub k11] and y[sup S, sub k01] is expected to be very small unless the sampling fraction is quite large (see a proof at http://www-personal.umich.edu/~mrelliot/ census/appendix.ps). The marginal covariance between y[sup S, sub k11] and z[sup S, sub k1.] and y[sup S, sub k01] and z[sup S, sub k1.] is more problematic--because the erroneous enumerations subtracted from the Census counts and the Census status of the persons sampled in the CMS are being estimated from the same sample, there may be nontrivial correlations regardless of the sample size. Because information on this covariance was not available, we assume independence as a simplifying assumption.
3.1 Estimating the Posterior Distribution
The mode of the posterior distribution of {Psi[sup *, sub k]} might be computed by a numerical optimizing algorithm. However, given the large number of parameters relative to the data and presence of peaks near the boundary of support for Phi[sup S, sub k] in cells where z[sup S, sub k1.] < y[sup S, sub k11], Newton-Raphson and Fischer scoring algorithm performed poorly. Hence we used Gibbs sampling (Gelfand and Smith 1990; Gelman and Rubin 1992; Smith and Roberts 1993) to draw estimates of the population parameters from their joint posterior distribution. In brief, Gibbs sampling obtains draws from a joint distribution of p(Theta|y) for Theta = {Theta[sub 1],...,Theta[sub n]} by initializing Theta to some reasonable Theta[sup (0)] and drawing Theta[sup (1), sub 1] is similar to p(Theta[sub 1]|Theta[sup (0), sub 2],..., Theta[sup (0), sub n], y), Theta[sup (1), sub 2] is similar to p(Theta[sub 2]|Theta[sup (1), sub 1], Theta[sup (0), sub 3],..., Theta[sup (0), sub n], y), and so forth. As t arrow right Infinity, Theta[sup (t)] is similar to p(Theta[sub 1],..., Theta[sub n]|y). The Gibbs sampling approach estimates the entire posterior distribution and thus allows for greater flexibility in terms of point estimation and inference.
The conditional distributions can be summarized as follows:
• Psi[sub k..]|Rest proportional to N(Eta[sup Psi, sub k], (Xi[sup Psi, sub k])[sup 2]).
• Draw M[sub k] = Rho[sub k]/(1 + Rho[sub k]) (the fraction of males): M[sub k]|Rest proportional to N(Eta[sup M, sub k], (Xi[sup M, sub k])[sup 2]). Draws made from N[sub K](Eta[sup M], (Xi[sup M])[sup 2]) conditional on M = Rho/(1 + Rho), where Eta[sup M] = (Eta[sup M, sub 1],..., Eta[sup M, sub K])', Xi = diag((Xi[sup M, sub 1])[sup 2],..., (Xi[sup M, sub K])[sup 2]) via the SWEEP operator (Goodnight 1979; Little and Rubin 1987).
• Delta[sup S, sub k]|Rest proportional to (Delta[sup S, sub k])[sup a[sup S]-1] (1 - Delta[sup S, sub k])[sup b[sup S] - 1]exp(Eta[sup Delta, sub k]).
• Phi[sub S, sub k]|Rest proportional to (Phi[sup S, sub k])[sup c[sup S]-1] (1 - Phi[sup S, sub k])[sup d[sup S] - 1]exp(Eta[sup Delta, sub k]).
• Gamma[sup M]|Rest proportional to (Gamma[sup M])[sup Alpha-1]exp{-(Gamma[sup M]/Beta) + Sigma[sub k]Eta[sup Gamma, sub k]}.
Here Eta[sup Theta, sub k] and Xi[sup Theta, sub k] are linear functions of Psi[sup *, sub k] and data.
Draws from the nonstandard distributions were made via the inverse cdf method by determining the conditional pdf up to a constant. The constant of proportionality was obtained by numerical integration using 32-point Gauss-Hermite quadrature on each of 1,000 subintervals of the support or line segment where the posterior is observed to be nontrivially greater than 0. (For a complete description of the conditional distributions, see http://www-personal.umich.edu/~mrelliot/census/appendix.ps.)
3.2 Estimating the Hyperparameters
The variances (Tau[sup S, sub k11])[sup 2], (Tau[sup S, sub k01])[sup 2], and (u[sup S, sub k1.])[sup 2] were treated as known. The beta hyperparameters a[sup S], b[sup S], c[sup S], and d[sup S] were estimated using Gibbs sampling assuming a uniform hyperprior distribution. Because little information is available to estimate the gamma hyperparameters Alpha and Beta, we chose the "flattest" prior for which P(Gamma[sup M] is an element of [.5,2.0]) = .95. Specially, Alpha and Beta were chosen from a grid of values to maximize Alpha Beta[sup 2] under the constraint that P(.5 < X < 2.0) = .95 for X is similar to GAMMA(Alpha, Beta) (Casella and Berger 1992, pp. 186-187); this yielded Alpha = 9 and Beta = .1306. Rather than attempting to find an empirical Bayes estimate of Sigma[sup 2], we set Sigma = 1, which essentially allows the poststratum sex ratios to vary freely, subject to the constraint that they yield the DA estimate when aggregated.
4.1 Population and Undercount Estimation
We now apply the methods described earlier to the 1990 Census data in Table 2. The poststratified results under the ICR model are given in Table 3 (reproduced from table B-1 of Bell et al. 1996). Table 4 gives the MLEs under the FRR model using Bell's (1993) approach described in Section 1.2.
Five chains of the Gibbs sampler, each containing 1,000 draws, were run from different starting points, with the first 200 discarded as an initial "burn-in." The estimated ratio of the marginal posterior variance for all sequences to the mean within-sequence posterior variance, R, was used to check of the degree of convergence for the parameters (Gelman, Carlin, Stern, and Rubin 1995, pp. 331-332). Examination of all parameters estimated under our model for the 24 African-American age 30-49 strata shows that 1.00 < square root of R < 1.07, well below square root of R approximately equal to 1.2, which Gelman et al. indicated as acceptable convergence. (A more detailed description of methods to assess convergence of the Gibbs chain, together with a description of how initial parameter estimates were obtained, are available at http://www-personal.umich.edu/~mrelliot/census/appendix.ps.)
The posterior mean of Psi[sup M, sub k11] is simulated as the average of [Rho[sub k]/(1 + Rho[sub k])](Delta[sup S, sub k])[sup (t[sub j])](Phi[sup S, sub k])[sup (t[sub j])] Psi[sup (t[sub j]), sub k..] where subscript (t[sub j]) denotes the tth-cycle of the jth chain of the Gibbs algorithm. Similar transformations provide the posterior means of Psi[sub k10], Psi[sub k01], and Psi[sub k00]. The resulting estimates are given in Table 5. Use of a beta prior for the In-Census-In-CMS to total Census ratio has eliminated negative estimates for the In-Census-In-CMS and Out-Census--Out-CMS cells; more generally, the assumption of the common prior distribution has reduced the larger positive values and also increased the smaller positive values of this cell somewhat. The generally much larger estimates of the Out-Census-Out-CMS cell among men when compared with the ICR model is a consequence of the correlation bias estimated from the discrepancy between the ICR model-estimated and DA-estimated sex ratios. Note also that, in strata with negative cells, the In-Census-In-CMS cells are substantially reduced relative to their unadjusted values to compensate for the increase in the In-Census-Out-CMS cells; this is a result of the total Census counts having relatively smaller estimates of variability relative to the CMS counts, thus forcing the In-CMS-In-Census cells to "surrender" people to the Out-CMS-In-Census cells. In-Census-In-CMS counts increase somewhat in strata where all counts are positive to balance this loss.
Figure 1 and Table 7 indicate the differences between the population estimates in each poststratum for African-Americans age 30-49 using (a) Census estimates (minus imputations and estimates of erroneous enumerations), (b) maximum likelihood estimates for the ICR model, (c) maximum likelihood estimates for the FRR model adjusted to DA sex ratios using Bell's (1993) approach, and (d) posterior means under the model of Section 3. Note that estimates for (b) and (c) are identical for females. Several key observations can be derived from Figure 1:
• The undercount appears to be greatest, as might be expected, in poststrata 7-10 (renters residing in urban area of 250,000/+).
• The FRR models provide larger estimates than the ICR models, because these models adjust for correlation bias by forcing the total male/female ratio to equal or approximate DA sex ratio estimates. (Recall that females are assumed to have zero correlation bias.) This bias appears to be associated with the undercount itself, appearing larger in the rental poststrata than in the owner poststrata.
• The estimates derived for males from the Bayesian approach of Section 3 under the FRR assumption generally fall between the MLEs of Bell (1993) for the FRR model and the MLE ICR estimates. A discrepancy between the MLE and Bayesian FRR model appears in poststratum 10 for men (nonowners in Western urban areas). This can be explained in part by the fact that this poststratum (a) has the largest proportion of males as estimated by the post-CMS-adjusted data (SR = 1.030 under MLE for the ICR model and SR = 1.170 under MLE for the FRR model), and (b) has apparently poor Census coverage as estimated by the CMS. This poststratum contained the smallest proportion of the CMS that were identified in the Census and the third-smallest proportion of those in the Census who were estimated to have been in the CMS. However, this estimate of poor coverage is based on relatively unstable CMS estimates (the largest coefficient of variation (CV) for y[sup M, sub k01] and third-largest CV for y[sup M, sub k11]). Thus the Bayesian approach identifies and "corrects" to some degree this potential outlier, increasing its estimated coverage toward the all-strata mean. Similar discrepancies in strata 6 and 7 result in part from large CVs from the CMS estimates that allow the posterior results to be pulled toward the ICR estimates. The larger Bayes FRR estimate in stratum 4 is a result of removing the large negative cell in the "raw" Out CMS-In Census cell and a similarly large CV in the CMS estimate.
• Female estimates under the Bayes FRR models are somewhat smaller on average than under the MLE FRR model, possibly a consequence of the smoothing of the sex ratios. Exceptions are strata 4 and 11, where the removal of the large negative cells increases the estimate of the female population over the MLE estimates.
The total undercount for African-American women age 30-49 when compared against the published Census estimate (U.S. Census Bureau 1991) for African-Americans age 30-49 is 1.7% under the ICR/MLE FRR model and .5% under the Bayesian FRR model. The undercount for African-American men age 30-49 is estimated to be -1.1% under the ICR model, 6.7% under the FRR model using Bell's 1993 maximum likelihood approach, and 5.4% under the Bayes FRR model. The total undercount is .4% for the ICR model, 4.1% for FRR MLE, and 2.9% for FRR Bayes.
4.2 Model Fit: Sex Ratios
Das Gupta (Bell et al. 1996) argued that the approach of Bell (1993) yields undue variation in estimated sex-ratio (SR) variation between strata. (SRs under Bell's FRR model range between .822 and 1.170; SRs under Das Gupta's own model vary between .789 and 1.050.) We introduce a prior for the poststratum sex ratios Rho[sub k] that assumes that Rho[sub k] is similar to N(Rho, Sigma[sup 2]), where Rho = .9063, the demographic estimate for this age-race group. This allows a degree of control over the variability of the sex ratios, ranging from forcing all sex ratios to equal the estimated nationwide sex ratio (which assumes Sigma[sup 2] = 0) and estimating the poststratum sex ratio without restriction (which assumes Sigma[sup 2] = Infinity). Table 6 gives the male-to-female sex ratio for this model where Sigma[sup 2] = 1, together with the adjusted census estimates, the ICR model, and the FRR and FDCR MLEs. The choice of Sigma[sup 2] = 1 allows the data from the likelihood to overwhelm the prior and thus essentially estimates sex ratios only under the constraint that Sigma[sub k]w[sub k]Rho[sub k] = Rho, where w[sub k] = (Psi[sub k..])/(Sigma[sub k]Psi[sub k..]). Even when the sex ratio is essentially unconstrained by the prior variance, the variability across the poststrata estimates is less than for either the FRR and FDCR MLEs.
However, the presence of sex ratios greater than 1 in poststrata 4 and 10 seems unlikely and suggests that the FRR model with this prior may be yielding overly inflated estimates of sex ratio variability. Thus we considered a prior variance for Rho[sub k] of Sigma = .033, which reduced P(Rho[sub k] > 1) to less than .002. The results are given in the far right column of Table 6. This moderately informative prior has greatly dampened the sex ratio variation, although owner strata still have slightly higher proportions of males, as would be expected. The extent to which Sigma[sup 2] should be reduced to dampen the perhaps overly variable within-strata sex ratio estimates is a matter that additional demographic expertise should address.
4.3 Model Fit: Posterior Predictive Distributions
A key assumption in the foregoing model is that negative differences z[sup S, sub k1.] - y[sup S, sub k11] are attributable to sampling error alone, and not caused by systematic differences in the Census and CMS methods of enumeration. Overstatement of the number of matches from the CMS will bias upward the estimated In-Census-In-CMS count y[sup S, sub k11], and overstatement of the rate of erroneous enumerations from the Census will bias downward the Census count z[sup S, sub k1.]. This would clearly be the case if both (Tau[sup S, sub k11])[sup 2] and (u[sup S, sub k1.])[sup 2] were both near 0--that is, we assumed that y[sup S, sub k11] and z[sup S, sub k1.] were essentially exact estimates of the true population values Psi[sup S, sub k11] and Psi[sup S, sub k1.]. Hence the negative Out-CMS-In-Census counts potentially provide evidence of lack of fit of the data to the model. To examine this possibility, we utilize posterior predictive distributions (PPDs) (Gelman et al. 1995; Gelman, Meng, and Stern 1996).
Classic p values represent the probability under the model (typically at some fixed parameter value Theta) that the observed statistics T(y) will be less than (or greater than) the values of the statistic that would be seen in repeated observations: P(T(y) </= T(y[sup rep])|Theta). The PPD p value represents the probability that the observed statistic (which can be a function of both the data y and the parameter Theta) is more extreme than replicated statistic, conditional on the observed data: P(T(y, Theta) </= T(y[sup rep], Theta)|y). PPD p values can be obtained from the draws of Theta generated by the Gibbs sampler; y[sup rep] can be drawn from f(y|Theta[sup rep]), and T(y[sup rep], Theta[sup rep]) compared with T(y, Theta[sup rep]). Specifically, we draw (y[sup S, sub k11])[sup rep] from N((Psi[sup S, sub k11])[sup rep], (Tau[sup S, sub k11])[sup 2]), where (Psi[sup S, sub k11])[sup rep] is constructed from the draws of Rho[sub k], Delta[sup S, sub k], Phi[sup S, sub k], and Psi[sub k..] used to estimate E(Psi[sup S, sub k11]|y[sup S, sub k11], y[sup S, sub k01], z[sup S, sub k1.]) earlier. Similarly, we draw (z[sup S, sub k1.])[sup rep] from N((Psi[sup S, sub k1.])[sup rep], (u[sup S, sub k1.])[sup 2]). We then compare (z[sup S, sub k1.])[sup rep] - (y[sup S, sub k11])[sup rep] with z[sup S, sub k1.] - y[sup S, sub k11]. Examining the histograms (not shown) shows that the only strata for which the observed data appear in the tail of the predictive distributions are poststratum 4 for both males and females: the PPD p value is .070 for males and .054 for females. Thus the large negative Out-Census-In-CMS cells are not necessarily evidence of bias in the estimation of CMS capture or of erroneous enumeration and imputation.
Another PPD for diagnosing model fit is given by Figure 2, which plots the percentage of men in the draws of the total In-Census counts against in total In-Census count [z[sup M, sub k1.]/(z[sup M, sub k1.] + z[sup F, sub k1.]) versus (z[sup M, sub k1.] + z[sup F, sub k1.])]. Figure 2 highlights two features of model fit. First, small estimates of the "true" In-Census count (Psi[sup M, sub k1.] + Psi[sup F, sub k1.]) are associated with higher percentages of men. This is because small In-Census estimates are associated with smaller estimates of follow-up enumeration (Psi[sup S, sub k11]/Psi[sup S, sub k.1]), which in turn are associated with a greater degree of correlation bias, which leads to larger estimates of male population relative to the female population. Second, all of the observed In-Census sex ratios are well within the model predictions, although they tend to lie in the lower range of the predictions.
4.4 Inference About Populations
Inferences about the posterior distribution of parameters of interest can be easily obtained from the distribution of the Gibbs draws. For example, an estimate of the 95% posterior probability interval can be obtained by noting the 100th and 3,900th smallest of the 4,000 draws from the posterior distribution. Table 7 gives the mean and 95% posterior probability intervals (PPIs) for the total population in each of the 12 poststrata for African-Americans age 3049 under the Bayes FRR model, together with the adjusted Census estimate and Bell's (1993) MLEs under the ICR and FRR models. The 95% PPI for the undercount is (-1.8%-2.6%) for females, (3.2%-7.5%) for males, and (.6%-5.2%) overall.
Figure 3 plots draws of the total population against the proportion of the population that is male, and shows how easily other posterior predictive intervals, including multidimensional intervals, can be generated. Here we see that larger estimates of total population correspond with larger estimates of correlation bias and hence larger estimates of the proportion of the population that is male. MLEs of the equivalent results for the ICR and FRR models are also presented, along with the adjusted census counts.
In this article we have summarized methods proposed for incorporating post Census CMS and demographic data into estimates of Census subpopulation counts. All methods face the difficulty that the underlying cell counts in the 2 x 2 Census-CMS poststratification tables are unidentifiable unless a model is posited for the population. The simplifying assumption of independence--that the probabilities of capture and recapture are independent and homogeneous across the population--leads to ratios of males to females that are typically lower than estimates from demographic analysis. Numerous plausible models can be suggested that incorporate this sex ratio data, all providing perfect fits to the data. We have suggested six principles--plausibility, fit, prediction, independence model inclusion, stability, and conservatism--to help choose among the models. (These principles are generally applicable to choosing among any statistical models whenever the fit principle alone is deemed insufficient.) Use of these principles suggests a fixed relative risk for enumeration in the CMS and Census (FRR) model [see (3)] as a leading candidate model for selection.
Beyond these qualitative discussions, we have described a more comprehensive statistical model that, through judicious choice of parameterization and prior distributions, eliminates negative cell estimates from the In-Census-Out-CMS cell of the poststratification tables, and reduces outlying predictions of undercount rates. Applying this approach to the FRR model using 1990 Census data for African-Americans age 30-49 yielded estimates of undercount of .5% for women, 5.4% for men, and 2.9% overall in this race-age category, with corresponding 95% posterior undercount intervals of (-1.8%-2.6%), (3.2%-7.5%), and (.6%-5.2%). The estimated 1990 census undercount among African-Americans age 30-49 using the FRR MLEs (Bell 1993) are 1.7% for women, 6.7% for men, and 4.1% overall; no confidence intervals are easily available. Our approach identifies potential outliers in the poststrata tables and reduces their impact on post-CMS total population estimates. In addition, the model detects, through posterior predictive distributions, strata in which large negative raw In-Census-Out-CMS cells may be due to bias in, rather than variance in, CMS enumeration or census estimates. Finally, our model allows direct control over the interstrata sex ratio variation Rho[sub k] through the variance parameter Sigma[sup 2]. To be conservative, we have utilized Sigma[sup 2] = 1, which essentially flattens the prior distribution relative to the likelihood. As with the MLEs, some of the resulting within-poststratum sex ratio estimates yielded implausibly large proportions of males. If prior estimates of Sigma are available--they were not to us--they might be utilized, in the process reducing poststrata sex ratio variability to more plausible values. It should also be noted that our results are based on simulations drawn from the estimated posterior distribution of the poststrata populations, and hence are subject to simulation errors.
Many extensions of the methods and models described could be envisaged. One immediate extension would be to introduce a prior for Rho to account for known uncertainty in the DA-estimated nationwide sex ratio. Also, the estimates of variability in the CMS and Census data are treated as fixed; prior distributions could be assumed to estimate any uncertainty in their values. Additional demographic measures could be incorporated in our model through careful choice of parameterizations. Prior means for the poststrata cell data or sex ratios could be regressed on poststratum characteristics to further reduce the dimensionality of the model. In the future, covariates more closely associated with the probability of response (Slud 1998), such as number and type of contact attempts required, could be used to form the poststrata, reducing correlation bias.
Additionally, the sensitivity of the estimates under the various model assumptions (2)-(6) could be examined, as Bell (1993) did in the context of maximum likelihood estimation. A more complete extension would average the results across the various models by noting that for J models M[sub j],
P(Psi|data) = *(This character cannot be converted in ASCII text) P(Psi|data, M[sub j]) P(M[sub j]|data)dM[sub j] [Multiple line equation(s) cannot be represented in ASCII text] P(Psi|data, M[sub j])P(data|M[sub j])P(M[sub j]).
Thus if all models are considered a priori likely, P(M[sub 1]) = ... = P(M[sub J]) = 1/J, and if all are saturated so that P(M[sub 1]|data) = ... = P(M[sub J]|data), then the marginal posterior expectation and variance of, say, the total population could be obtained as
(7) [Multiple line equation(s) cannot be represented in ASCII text]
and
[Multiple line equation(s) cannot be represented in ASCII text].
One criticism of our proposed approach is that it is explicitly Bayesian, and hence incorporates subjective elements through the choices of model and prior. However, every method of modern Census enumeration aimed at getting counts of the full population requires subjective assumptions--including methods that leave raw Census counts unadjusted. The Bayesian framework makes these assumptions explicit and open to debate, rather than implicit in the estimation algorithm. A second criticism is that the computations are complex, and simple transparent methods that are relatively easy to explain to lay audiences are preferable. We too favor simplicity, but think a distinction needs to be made between the underlying assumptions of the model, which are not particularly complex and are capable of being transmitted in nontechnical terms, and the algorithms used to simulate posterior distributions based on the model, which are very complex but need not be understood by nonstatistical stakeholders. By analogy, least squares estimation for multiple linear regression is beyond the understanding of the large majority of the public who did not get beyond a single elementary statistics course, but that does not prevent its widespread use in real-world policy settings that involve analysis of data. More generally, complex microsimulation and statistical models with subjective assumptions underlie the interpretations of much data that are used to inform public policy in the economic and health arenas.
Legend for Chart: B - Observed data C - Observed data CMS In D - Observed data CMS Out F - Underlying parameters CMS In G - Underlying parameters CMS Out H - Underlying parameters A B C D E F G H Census In y[sup S, sub k11] z[sup S, sub k1] Psi[sup S, sub k11] Psi[sup S, sub k10] Psi[sup S, sub k1.] Out y[sup S, sub k01] Psi[sup S, sub k01] Psi[sup S, sub k00] Psi[sup S, sub k0.] y[sup S, sub k.1] Psi[sup S, sub k.1] Psi[sup S, sub k.0] Psi[sup S, sub k..] NOTE: y[sup S, sub k11] and y[sup S, sub k01] are estimated counts of individuals in and out of the census on the basis of the CMS follow-up: z[sup S, sub k1] is the census count, minus imputations and an estimate of erroneous enumerations; Psi[sup S, sub kij] is the population that would reside in the ijth cell if the CMS had been a complete census. (stratum = k; gender = S).
Legend for Chart: A - Males B - y[sub M, sub k11] C - y[sub M, sub k01] D - y[sub M, sub k1.] A B C D PS 1 141,567(16,461) 17,672(3,100) 205,692(4,262) PS 2 379,139(37,049) 40,835(6,257) 460,849(4,856) PS 3 210,725(25,075) 23,433(3,579) 260,288(2,794) PS 4 178,356(38,354) 27,737(7,639) 124,959(3,907) PS 5 314,080(49,222) 33,038(6,777) 315,599(5,602) PS 6 246,492(51,920) 38,469(12,365) 278,798(5,100) PS 7 216,082(31,396) 86,040(15,883) 300,904(8,555) PS 8 279,818(36,308) 84,300(12,284) 429,687(7,476) PS 9 136,885(16,663) 48,690(6,988) 215,578(5,497) PS 10 111,513(22,167) 44,969(12,379) 169,313(6,255) PS 11 313,894(56,391) 67,312(9,911) 297,656(5,216) PS 12 52,785(11,822) 11,778(3,017) 80,882(4,856) Legend for Chart: A - Females B - y[sub F, sub k11] C - y[sub F, sub k01] D - y[sub F, sub k1.] A B C D PS 1 173,719(20,288) 24,451 (4,556) 248,862(4,086) PS 2 502,134(53,051) 35,945(5,698) 547,036(4,450) PS 3 271,782(32,200) 22,838(3,855) 308,192(3,278) PS 4 223,704(57,827) 42,301 (17,562) 134,557(3,032) PS 5 371,117(58,356) 25,382(7,092) 361,009(3,664) PS 6 274,158(52,666) 39,471 (12,800) 318,979(6,390) PS 7 328,988(54,152) 95,224(16,530) 450,241 (7,635) PS 8 395,789(46,484) 59,477(8,247) 575,657(13,836) PS 9 333,752(47,017) 51,880(7,168) 345,949(4,110) PS 10 177,245(36,334) 23,922(5,913) 203,188(5,301) PS 11 482,488(87,229) 50,058(9,590) 415,852(4,846) PS 12 77,267(21,163) 8,457(2,772) 105,417(3,103)
Legend for Chart: A - Census CMS: B - In In C - In Out D - Out In E - Out Out F - Total A B C D E F Males PS 1 142 64 18 8 231 PS 2 379 82 41 9 510 PS 3 211 50 23 6 289 PS 4 178 -53 28 -8 144 PS 5 314 2 33 0 349 PS 6 246 32 38 5 322 PS 7 216 85 86 34 421 PS 8 280 150 84 45 559 PS 9 137 79 49 28 292 PS 10 111 58 45 23 238 PS 11 314 -16 67 -4 361 PS 12 53 28 12 6 77 Total 2,581 559 524 152 3,817 Females PS 1 174 75 24 11 284 PS 2 502 45 36 3 586 PS 3 272 36 23 3 334 PS 4 224 -89 42 -17 160 PS 5 371 -10 25 -1 386 PS 6 274 45 39 6 365 PS 7 329 121 95 35 581 PS 8 396 180 59 27 662 PS 9 334 12 52 2 400 PS 10 177 26 24 4 231 PS 11 482 -66 50 -7 459 PS 12 99 28 8 3 117 Total 3,612 403 479 69 4,564
Legend for Chart: A - Census CMS: B - In In C - In Out D - Out In E - Out Out F - Total A B C D E F Males PS 1 142 64 18 20 243 PS 2 379 82 41 32 534 PS 3 211 50 23 19 303 PS 4 125 0 19 9 154 PS 5 314 2 33 16 364 PS 6 246 32 38 26 343 PS 7 216 85 86 90 477 PS 8 280 150 84 106 620 PS 9 137 79 49 64 328 PS 10 112 58 45 56 270 PS 11 298 0 64 30 392 PS 12 53 28 12 15 107 Total 2,512 628 512 483 4,136 Females PS 1 174 75 24 11 284 PS 2 502 45 36 3 586 PS 3 272 36 23 3 334 PS 4 135 0 25 0 160 PS 5 361 0 25 0 386 PS 6 274 45 39 6 365 PS 7 329 121 95 35 581 PS 8 396 180 59 27 662 PS 9 334 12 52 2 400 PS 10 177 26 24 4 231 PS 11 416 0 43 0 459 PS 12 77 28 8 3 117 Total 3,446 569 455 94 4,564 NOTE: In CMS-Out census negative cell counts are set to 0 with Bell's marginal adjustment.
Legend for Chart: A - Census CMS: B - In In C - In Out D - Out In E - Out Out F - Total A B C D E F Males PS 1 145 60 17 20 242 PS 2 417 44 40 26 527 PS 3 236 24 23 15 298 PS 4 123 3 26 14 165 PS 5 309 7 32 17 364 PS 6 267 11 31 17 326 PS 7 239 60 82 74 456 PS 8 282 148 84 110 623 PS 9 139 76 48 64 327 PS 10 128 40 42 43 253 PS 11 292 6 66 34 398 PS 12 62 18 11 11 102 Total 2,638 496 501 445 4,081 Females PS 1 179 66 26 10 281 PS 2 519 12 50 1 583 PS 3 290 10 29 1 331 PS 4 128 3 30 1 161 PS 5 343 6 34 1 384 PS 6 297 13 39 2 350 PS 7 358 59 120 22 559 PS 8 380 146 92 37 655 PS 9 307 6 83 2 398 PS 10 174 7 43 2 226 PS 11 383 6 75 1 465 PS 12 90 8 13 1 112 Total 3,450 343 632 81 4,505
Legend for Chart: B - Adjusted census estimate C - ICR MLE D - FRR MLE E - (Sigma = 1) Bayes FRR F - (Sigma = .033) Bayes FRR A B C D E F PS 1 .826 .815 .858 .863 .890 PS 2 .842 .871 .911 .904 .905 PS 3 .845 .866 .907 .901 .902 PS 4 .929 .902 .960 1.042 .912 PS 5 .874 .904 .945 .948 .931 PS 6 .874 .833 .940 .934 .911 PS 7 .668 .725 .822 .817 .900 PS 8 .746 .844 .937 .952 .916 PS 9 .623 .731 .822 .822 .890 PS 10 .833 1.030 1.170 1.123 .918 PS 11 .716 .788 .853 .857 .893 PS 12 .769 .846 .919 .904 .906 Total .782 .836 .906 .906 .906 s(x 10[sup -2]) 9.18 8.24 9.20 8.49 1.23
Legend for Chart: A - PS B - Adjusted Census Estimate C - ICR MLE D - FRR MLE E - Bayes FRR F - Bayes FRR 95% PPI A B C D E F 1 455 515 527 523 (498-550) 2 1,008 1,097 1,120 1,109 (1,079-1,142) 3 568 623 637 629 (610-650) 4 260 304 314 326 (278-384) 5 677 734 750 748 (716-781) 6 598 687 708 676 (614-739) 7 751 1,001 1,058 1,015 (922-1,1 28) 8 1,005 1,221 1,282 1,278 (1,191-1,381) 9 562 692 728 725 (682-774) 10 373 468 500 479 (419-555) 11 714 820 851 863 (821-909) 12 186 216 224 214 (192-242)
GRAPHS: Figure 1. Percent of Total Population Relative to Census Counts Minus Imputations and Erroneous Enumerations (Z[sub k1.]) Under the Independence (ICR) Mode/Using MLEs (---), Under the Fixed Relative Risk (FRR) Mode/ Using MLEs (.....), and Under the FRR Bayes Model Using Posterior Means (- - - -), by Gender and Stratum.
GRAPHS: Figure 2. Posterior Predictive Draws of Adjusted Census Counts (Z[sub k1.]): Percent Male (y-Axis) Versus Total (z-Axis). Observed data given by "X."
GRAPHS: Figure 3. Gibbs Sampling Draws of the Total Population Estimates Within Each Poststratum: Percent Male (y-Axis) Versus Total (x-Axis). Adjusted Census counts and MLEs for the ICR and FRR models given by "C," "I," and "F."
Bell, W. (1993), "Using Information From Demographic Analysis in Post-Enumeration Survey Estimation," Journal of the American Statistical Association, 88, 1106-1118.
Bell, W., Gibson, C., Das Gupta, P., Spencer, G., Robinson, G., Mulry, M., Vacca, A., Far, R., and Leggieri, C. (1996), "Report of the Working Group on the Use of Demographic Analysis in Census 2000," Bureau of the Census, U.S. Department of Commerce.
Casella, G., and Berger, J. (1990), Statistical Inference, Belmont, CA: Duxbury Press.
Choi, C. Y., Steel, D. G., and Skinner, T. J. (1988), "Adjusting the 1986 Australian Census Count for Underenumeration," Survey Methodology, 14, 173-190.
Das Gupta, P., and Robinson, J. G. (1990), "Combining Demographic Analysis and Post-Enumeration Survey to Estimate Census Undercount," Bureau of the Census, U.S. Department of Commerce.
Fay, R. E., Passel, J. S., and Robinson, J. G. (1988), "The Coverage of Population in the 1980 Census," Bureau of the Census, U.S. Department of Commerce.
Gelfand, A. E., and Smith, A. M. F. (1990), "Sampling-Based Approaches to Calculating Marginal Densities," Journal of the American Statistical Association, 85, 389-409.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, New York: Chapman and Hall.
Gelman, A., Meng, X-L., and Stern, H. S. (1996), "Posterior Predictive Assessment of Model Fitness via Realized Discrepancies" (with discussion), Statistica Sinica, 6, 733-807.
Gelman, A., and Rubin, D. B. (1992), "Inference From Iterative Simulation Using Multiple Sequences," Statistical Science, 7, 457-472.
Goodnight, J. H. (1979), "A Tutorial on the SWEEP Operator," The American Statistician, 33, 149-158.
Hogan, H. (1992), "The 1990 Post-Enumeration Survey: Operations and New Results," in Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 28-37.
Isaki, C. T., and Schultz, L. K. (1986), "Dual-System Estimation Using Demographic Analysis Data," Journal of Official Statistics, 2, 169-179.
Little, R. J. A., and Rubin, D. B. (1987), Statistical Analysis With Missing Data, New York: Wiley.
Raftery, A. E., Givens, G. H., and Zeh, J. E. (1995), "Inference From a Deterministic Population Dynamics Model for Bowhead Whales," Journal of the American Statistical Association, 90, 402-416.
Robinson, J. G. (1993), "Estimation of Population Coverage in the 1990 United States Census Based on Demographic Analysis," Journal of the American Statistical Association, 88, 1061-1071.
Robinson, J. G. (1995), "Coverage Measurement and Evaluation in the 2000 Census: What is the Role of Demographic Analysis," paper presented at the Census Advisory Committee of Professional Associations meeting.
Seber, G. A. F. (1982), The Estimation of Animal Abundance and Related Parameters, New York: Macmillan.
Sekar, C. C., and Deming, W. E. (1949), "On a Method of Estimating Birth and Death Rates and the Extent of Registration," Journal of the American Statistical Association, 44, 101-115.
Slud, E. (1998), "Predictive Models for Decennial Census Household Response," in Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 272-277.
Smith, A. F. M., and Roberts, G. O. (1993), "Bayesian Computation via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods," Journal of the Royal Statistical Society, Ser. B, 55, 3-23.
U.S. Census Bureau (1991), 1990 Census of Population: United States, U.S. Census CP-1-1, Washington, DC: U.S. Department of Commerce.
Wolter, K. M. (1990), "Capture-Recapture Estimation in the Presence of a Known Sex Ratio," Biometrics, 46, 157-162.
[Received December 1998. Revised December 1999.]
~~~~~~~~
By Michael R. Elliott and Roderick J. A. Little
Michael Elliott is a Ph.D. student and Roderick Little is Professor and
Chair, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109
(E-mail: mrelliot@umich.edu). This research was supported by Bureau of the
Census contract 50-YABC-7-66020, task 46-YABC-7-0002. The authors would like to
thank Gregg Robinson and Eric Schindler for making the Census data available, and Alan Zaslavsky, Thomas Belin,
the associate editor, and an anonymous reviewer for their review and comments.
Title: | Book Reviews. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Reviews the book `Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers,' by Thomas Leonard and John S. J. Hsu. |
AN: | 3167628 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
Thomas LEONARD and John S. J. HSU. New York: Cambridge University Press, 1999. ISBN 0-521-59417-0. xi + 333 pp. $64.95.
In this book's Preface, the authors comment that the state of statistical science is continuously evolving, and that it is important for applied researchers to be able to use the new methodology with specific knowledge of the assumptions involved in these methods. They describe the two schools of statistical thought, the "Fisherian" (usually called classical or frequentist) and Bayesian philosophies, and state that Bayesian methods have a number of advantages over the Fisherian procedures, including good long-run frequency properties. This book aims to show how Bayesian statistical methods can be used in drawing scientific, medical, and social conclusions from data. The book is intended for use by masters-level students learning statistics from both Fisherian and Bayesian viewpoints, by interdisciplinary researchers working in statistical modeling in their own area, and by doctoral students interested in research in Bayesian methodology. I was very interested in this book, as I see a need for texts that show the advantages of Bayesian methodology in a variety of applied statistical settings.
Chapter 1 sets the stage for the Bayesian material to follow. The first section defines the likelihood function, the maximum likelihood estimate, and different information criteria [such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC)] that can be used to compare models. Large-sample properties of likelihood-based procedures are described, and a number of examples are worked out carefully to illustrate the generality of a likelihood approach to make inferences for a particular model, or to compare models. Chapter 1 concludes with statements of important results relative to likelihood inference, such as the factorization theorem, the Cramer-Rao and Blackwell-Rao theorems, the likelihood inference, and the use of a profile likelihood to eliminate nuisance parameters. It seems that this chapter would serve as a useful review for those students who have been exposed to likelihood inference, but that the first-year graduate student would find some of the later topics, such as a profile likelihood, hard to follow.
Bayes's rule in the discrete setting is the focus of Chapter 2. The use of Bayes's theorem in medical diagnosis and legal settings is described, and an example illustrates Bayesian inference for a discrete-valued parameter. Section 3 describes choosing between a discrete set of models, and Bayes's rule together with a Schwarz approximation to the integrated likelihood are shown to give simple expressions for the posterior model probabilities. This chapter concludes with an interesting example showing how Bayes's rule can be applied in logistic discrimination.
Chapter 3 describes the implementation of the Bayesian paradigm in the one-parameter case. This chapter covers assessment of a prior distribution, computation of the posterior and predictive distributions, and various ways of summarizing the posterior distribution. There is disagreement among Bayesians on how to construct a test of a point null hypothesis of the form Theta = Theta[sub 0]. The proposal here is to compute a "Bayesian significance probability" P(Theta </= Theta[sub 0]| data) to make a decision. The authors do not advocate placing a positive probability on the value Theta[sub 0], which was discussed by Berger and Sellke (1987). Bayesian inference for many of the standard examples, such as binomial, Poisson, and normal mean with known example, are described here. Some of the examples, such as inference for an upper bound of a uniform distribution and normal mean inference using a uniform prior on a bounded interval, seem a bit artificial but may be helpful for practice in computing posterior and predictive densities. One section discusses the choice of suitable vague priors for a single parameter, including Jeffreys's invariance prior. The chapter concludes with a compact introduction to decision theory, including admissible and minimax procedures, and a discussion of the value of Bayesian estimates from this perspective. This discussion motivates the next chapter on construction of a utility function. This chapter proposes a method for eliciting one's utility function and introduces Savage's expected utility hypothesis.
Chapter 5 discusses Bayesian methods for multiple parameter problems. A Laplacian approximation is proposed as a general way of computing marginal posterior and predictive densities. These Laplace-type methods are illustrated for several examples, and the approximate marginal posterior densities are compared to densities computed by importance sampling. (It is interesting to note that the conjugate analysis for a normal mean and variance is included only as a self-study exercise.) The examples of this chapter, which reflect the authors' interests, include inference about parameters in a two-way contingency table, modeling a quasi-independence model, and Bayesian forecasting using the Kalman filter.
The concluding Chapter 6 deals with simultaneous estimation of many parameters, the Stein estimation problem. The chapter opens with a general approximation to a posterior when the likelihood is approximately normal and a multivariate normal prior is chosen for a suitably transformed parameter. This construction motivates the consideration of a normal prior, where a particular structure (say, exchangeability) is chosen for the parameters and several prior parameters of the structure are unknown. Four philosophical ways are described for handling the unknown hyperparameters, including hierarchical Bayes, empirical Bayes, and marginal and joint posterior modes. As in Chapter 5, Laplacian-type methods are used as a main computational tool. The examples of this chapter describe the smoothing effect of an exchangeable-type prior distribution in a variety of different settings.
Although a number of books are currently available on Bayesian Inference at the advanced graduate level (Bernardo and Smith 1994; Berger 1985; O'Hagan 1994), only a few texts, such as those of Carlin and Louis (1996) and Gelman, Carlin, Stern, and Rubin (1995), as well as this one, have the goal of presenting Bayesian methods for researchers of applied statistics. Thus it seems reasonable to compare this text with the texts of Carlin and Louis and Gelman et al.
In some ways, this book would serve as an excellent introduction to Bayesian methods. Chapter 1 gives the reader a nice background to likelihood-based inference, and Chapters 2 and 3 give a detailed explanation of how to use Bayes's rule for a single discrete or continuous-parameter problem. The many detailed worked-out examples and a large number of homework exercises make it a very suitable text for classroom use. However, in Chapters 5 and 6 (the multiple-parameter chapters), the book seems a bit outdated. I believe that one of the biggest advances over the last 30 years has been the development of hierarchical and empirical Bayes methodology to combine data from similar experiments. Also significant are the recent developments in Bayesian computation based on stochastic simulation. Simulation algorithms such as Gibbs sampling and Metropolis-Hastings have made it possible for Bayesians to analyze a wide range of models involving truncated or missing data, order restrictions, and nonparametric forms. The increasing interest in hierarchical models, together with the ability to fit these models by stochastic simulation, have made these methods more popular among applied statisticians. The books of Gelman et al. and Carlin and Louis reflect these advances in Bayesian methods. Much of their content is discussions of hierarchical modeling and Markov chain Monte Carlo (MCMC) algorithms. In contrast, Bayesian Methods has relatively little discussion on hierarchical modeling or simulation methodology. Some illustrations of Bayes-Stein-type estimators are presented in Chapter 6, but there is little discussion on the prior belief of exchangeability and when it may be desirable to combine information from related experiments. Regarding computation, the authors appear to have a preference for Laplace-type approximations, and applications of these methods predominate in Chapters 5 and 6. MCMC methods are introduced, almost as an afterthought, six pages from the end of the text. I believe that Laplace-type methods can be valuable in Bayesian inference, especially for problems with a relatively small number of parameters. However, MCMC methods have great potential to fit models and compute normalizing constants for models with a large number of parameters, and I am puzzled why this book essentially ignores these tools.
Despite this criticisms, I believe that Bayesian Methods complements the other graduate-level Bayesian methods books. The authors are experienced researchers in Bayesian methodology, and much of their research is reflected in the text. The book provides a solid, well-written introduction to the basic tenets of Bayesian modeling, and deserves serious consideration for adoption for a graduate-level introduction to Bayesian methods.
Bernardo, J., and Smith, A. F. M. (1994), Bayesian Theory, New York: Wiley.
Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis, New York: Springer-Verlag.
Berger, J. O., and Sellke, T. (1987), "Testing a point null hypothesis: the irreconcilability of P values and evidence" (with discussion), Journal of the American Statistical Association, 82, 112-139.
Carlin, B. P., and Louis, T. A. (1996), Bayes and Empirical Bayes Methods for Data Analysis, London: Chapman and Hall.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B., Bayesian Data Analysis, London: Chapman and Hall.
O'Hagan, A. (1994), Kendall's Advanced Theory of Statistics Vol. 2b: Bayesian Inference, London: Edward Arnold.
~~~~~~~~
By James H. Albert, Bowling Green State University
Title: | Book Reviews. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Reviews the book `Markov Chain Monte Carlo,' by Dani Gamerman. |
AN: | 3050340 |
ISSN: | 0040-1706 |
Database: | Business Source Premier |
Markov Chain Monte Carlo, by Dani GAMERMAN, London: Chapman & Hall, 1997, ISBN 0-412-818205, xiii + 245 pp., $44.95.
This is a useful textbook for learning the Markov chain Monte Carlo (MCMC) method, the stochastic simulation method for Bayesian inference. The book is self-contained. In addition to the main chapters (Chaps. 5 and 6) on Gibbs sampling and Metropolis-Hastings algorithms, the author added four more chapters as preliminaries. These four chapters are on stochastic simulation, Bayesian inference, approximation methods, and Markov chains. This book serves well as a one-year textbook for advanced undergraduate and graduate students without the needed background in Markov chains and Bayesian inference. It is also useful for readers without the preliminaries as a self-taught text or reference.
Although most of the material can be found in research papers and other textbooks, the author does do a good job in collecting and presenting it in a coherent fashion. Chapter 1 discusses the basic techniques for generating discrete and continuous random variables from a distribution. In addition to the inverse transform method, the author also discusses rejection, adaptive rejection, and the weighted resampling method. The book contains a useful section on how to generate a random vector from a multivariate normal or multivariate Student-t distribution and a random matrix from the Wishart distribution. Chapter 2 discusses the basic prior-to-posterior updating in Bayesian inference. It also provides useful examples of normal regression models and hierarchical and dynamic models that include the Kalman filter model. Chapter 3 mainly discusses the classical approaches for approximating the posterior distribution or a functional from the posterior distribution--normal approximations, Laplace approximations, Gaussian quadrature, and Monte Carlo integration. Chapter 4 discusses Markov chains as usually taught in an undergraduate stochastic process course. Including Chapter 4 makes this textbook self-contained. It also helps the reader to understand better the Markov-chain part of the simulation method.
As the author points out, the main course of the book is Chapters 5 and 6, Gibbs sampling and Metropolis-Hastings sampling. Gibbs sampling is an MCMC scheme in which the transition kernel is formed by the full conditional distributions. The author discusses the implementation of the Gibbs sampler and the convergence diagnostics. The chapter includes useful reparameterization and blocking ideas to speed up the convergence. The Metropolis and the Metropolis-Hastings algorithms are discussed clearly in Chapter 6. The chapter contains some useful examples in applications-generalized linear mixed models, longitudinal data analysis with random effects, and dynamic generalized linear models.
Chapter 7 includes issues in model adequacy and model selection. It contains a section on Markov chains with jumps that includes the diffusion-jumping algorithm to handle the moves between different submodels.
Given the recent exponential growth of research papers on MCMC, it would be difficult to list all the recent works in the bibliography. Nevertheless, the book does contain a substantial amount of literature with at least 100 paper citations. The book does have quite a few annoying typographical errors. A spell checker can easily fix most of them in the second edition. It would be helpful to correct a few typographical errors in the equations.
I have had my favorite MCMC textbooks: I like the elegance and the compactness of Tanner (1996) and the liveliness and easy reading of Gelman, Carlin, Stern, and Rubin (1995). I would consider this book a close second to those books. It does contain many details that are not in those two books. This book definitely makes the list of recommended textbooks when I teach Bayesian computation. It should be useful to many readers as a valuable MCMC reference book in addition to Gilks, Richardson, and Spiegelhalter (1996).
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman & Hall.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, Boca Raton, FL: Chapman & Hall/CRC.
Tanner, M. A. (1996), Tools for Statistical Inference (3rd ed.), New York: Springer-Verlag.
~~~~~~~~
By Lynn Kuo, University of Connecticut
Title: | Comment. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Comments on the article `Nonparametric Analysis of Randomized Experiments With Missing Covariate and Outcome Data,' by Joel L. Horowitz and Charles F. Manski. Approaches for analyzing missing or incomplete data; Comments on the approach for assessing subgroup differences in the prevalence rates. |
AN: | 3031449 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
The problem of missing data is one faced by many analysts. Several approaches for analyzing incomplete data have been, and continue to be, developed from various perspectives. All of these approaches, however, make some assumptions about the missing-data mechanism. The most prevalent assumption is that of missing at random (MAR) (Rubin 1976); that is, adjusting for the set of all observed variables, the unobserved residuals are random. This is a conditional assumption that has been shown to be reasonable in many practical situations when the relevant variables are included in the conditioning (David, Little, Samuhal, and Triest 1986; Rubin 1987; Rubin, Stern, and Vehovar 1995). The limitation of this assumption is due mainly to unavailability of relevant conditioning variables. This often can be addressed by collecting auxiliary data, including contextual data, relevant to the variables with missing values. If one is not willing to make this MAR assumption, then an explicit modeling of the missing-data mechanism is required through postulating either a selection mechanism (Heckman 1976) or a pattern mixture model (Little 1993, 1994; Rubin 1977). Both these models are empirically unverifiable given only the observed data. These are tough choices an analyst has to make when drawing inferences from incomplete data.
Horowitz and Manski (hereinafter HM) propose an alternative, along the lines of a method suggested by Cochran (1977, p. 361), that in the presence of missing data and to avoid making any assumption about the missing-data mechanism, we should abandon the traditional notion of a point estimate and its confidence interval as means of drawing inferences, and acknowledge that only "bounds" for the point estimates can be constructed and that the confidence intervals for these bounds should be used for inferential purposes. Incidentally, Cochran (1977, p. 362) stated that "the limits are distressingly wide unless the nonresponse rate is very small." Cochran also investigated the sample sizes that would be needed to give the same widths of confidence interval if the nonresponse rate were zero. He stated that "if we are content with a crude estimate (absolute error = 20%), amounts of nonresponse up to 10% can be handled by doubling the sample size. However, any sizeable percentage of nonresponse makes it impossible or very costly to attain a highly guaranteed precision by increasing the sample size among the respondents" (p. 363).
HM focus on a binary outcome variable as did Cochran (1977). In some parts of the article they seem to make several strong assumptions. For example, the covariates X are assumed to be missing completely at random (MCAR), which not only seems to be rather implausible in many observational and randomized experiments but can be tested from observed data. Also, it is not clear what one would do for a continuous variable such as income or for a continuous covariate.
In fact, the HM framework is an elegant presentation of a crude sensitivity analysis where the bounds are created by two unrealistic "what-if" scenarios. To illustrate the usefulness of this approach, consider the data from the National Comorbidity Survey (NCS), a national probability sample of about 10,000 individuals (Kessler et al. 1994). The primary aim of this survey is to estimate the prevalence rates for 14 types of psychiatric disorders and support epidemiological investigations of associations between several risk factors and these disorders. The prevalence of social phobia, one of the 14 disorders, is of particular interest. The response rate for this item in the survey was 85.9%, and among the respondents, the prevalence rate of social phobia was 13.08%. The bounds are constructed by computing the prevalence rates under two scenarios: all nonrespondents have social phobia and all nonrespondents do not have social phobia. Let r denote the response rate and PR is the prevalence of social phobia among the respondents, the bounds are (L[sub n] = r p[sub R], U[sub n] = rp[sub R]+1-r). In the HM framework, all we can say that the actual prevalence estimate is contained in the interval (11.3%, 25.4%).
For inference, HM suggest using a wider confidence bound, (L[sub n] -z[sub n Alpha], U[sub n] + z[sub n Alpha]), which contains the population bounds (L = Pi Theta[sub R], U = Pi Theta[sub R] + 1 - Pi) with probability 1 - Alpha where E(r) = Pi and E(p[sub R]) = Theta[sub R]. This is a random interval that envelopes with probability 1 - Alpha, the region of the parameter space (L, U) that contains our inferential parameter of interest Theta = Pi Theta[sub R] + (1 - Pi)Theta[sub NR], where Theta[sub NR] is the prevalence rate among the nonrespondents. For this example, using the bootstrap procedure described by HM, I obtained the 95% confidence interval to be (10.7%, 26.1%), which is considerably wider than the 95% confidence interval under MCAR (12.4%, 13.8%). Even under the MAR assumption, the 95% confidence interval using weighting to correct for missing data was (12.2%, 13.9%).
HM's approach for assessing subgroup differences in the prevalence rates is even more problematic. For example, the prevalence of social phobia among male and female respondents were 11.4% and 14.6%--a large, statistically significant difference based on MCAR and MAR assumptions. The response rates for males and females differed slightly (84.2% for males and 87.4% for females). Following HM's approach, all that can be said is that the difference in the actual estimated prevalence rates is somewhere between -15.7% and +12.63%. The 95% confidence bound for this interval is an even wider interval; that is, any difference smaller than 15.7 percentage points would be deemed not statistically significant at any level of significance under their scheme. Further classification by Race, for example, produces even wider intervals. Even under MCAR, any difference smaller than (1 - r)/r, where r is the overall response rate, would be considered not statistically significant at any level of significance regardless of the sample size. In fact, under this approach, many well-established risk factors for these psychiatric disorders would be deemed to be not statistically significant. This seems a very costly approach for "encompassing all nonrefutable assumptions about the nature of missing data," as HM claim!
If the sensitivity analysis (which always should be a component in the analysis of incomplete data) is the goal, then it can be handled rather easily within the Bayesian framework. Gelman, Carlin, Stern, and Rubin (1993) and Rubin (1977) have given several general examples, and Raghunathan (1994) and Raghunathan and Siscovick (1996) have provided examples using binary outcome variables. For the NCS example, let n[sub R], n[sub NR], y[sub R], and y[sub NR] denote the sample sizes and binomial responses for respondents and nonrespondents (with y[sub NR] unobserved). A pattern mixture model may be used for outcome variables:
y[sub R] is similar to Bin(n[sub R], Theta[sub R]),
y[sub NR] is similar to Bin(n[sub NR], Theta[sub NR]),
and
n[sub R] is similar to Bin(n[sub R] + n[sub NR], Pi).
For the unknown parameters, Theta[sub R], Theta[sub NR], and Pi, the following prior distribution specifications completes the model:
Pr(Pi) proportional to Pi[sup -1](1 - Pi)[sup -1],
Pr(Theta[sub R]) proportional to Pi[sup -1, sub R](1 - Theta[sub R])[sup -1],
and
Pr(Theta[sub NR]|Theta[sub R]) is similar to Beta (Alpha, Beta),
where Alpha and Beta are chosen so that prior conditional mean of Theta[sub NR] is Mu = Alpha/(Alpha + Beta) = (1 + d)Theta[sub R] and variance Sigma[sup 2] = Mu(1 - Mu)/(Alpha + Beta + 2) = c[sup 2]Theta[sup 2, sub R]. Here d and c govern the extent to which respondent and nonrespondents differ and can be used to index the sensitivity analysis, with d representing the systematic bias and c/(1 + d) the coefficient of variation measuring the a priori uncertainty in the actual prevalence rate among the nonrespondents. The observed data, however, cannot provide any information about c and d. When c = 0 and d = 0, the data are MAR.
The inferential quantity of interest is Theta = Pi Theta[sub R] + (1 -Pi)Theta[sub NR]. The relevant joint posterior distribution of unknown quantities of interest is
Pr(Pi, Theta[sub R], Theta[sub NR], y[sub NR]|y[sub R], n[sub R], n[sub NR], d, c)
proportional to Pi[sup n[sub R-1]](1 - Pi)[sup n[sub NR-1]]Theta[sup y[sub R-1], sub R](1 - Theta[sub R])[sup n[sub R]-y[sub R-1]]
x Theta[sup y[sub NR]+Alpha-1, sub NR](1 - Theta[sub NR])[sup n[sub NR]-y[sub NR]+Beta-1].
Markov chain Monte Carlo methods such as Gibbs sampling can be easily used to generate values from the foregoing posterior distribution, which then leads to the posterior distribution of Theta. The only complicated step in this analysis is obtaining draws from the conditional posterior distribution of Theta[sub R] given all other quantities, as Alpha and are also functions of Theta[sub R]. This can be handled easily using a rejection-type algorithm. I have tabulated 95% posterior intervals for various choices of c and d in Table 1. Even with 50% bias and the coefficient of variation of 33%, the 95% posterior interval is (12.3, 15.9), relatively closer to the MAR interval. In fact, HM's confidence bounds correspond to the 95% posterior interval when the systematic bias is 315% and the coefficient of variation is 68% (the last row in Table 1). The beta distribution for Theta[sub NR] is merely an example; other distributions can be used in its place. Nevertheless, the beta distribution is mathematically convenient and relatively rich in distributional characteristics. The pattern mixture model developed here can also be extended for several levels of covariates.
Even for a large survey with a relatively large response rate like the NCS, HM's bounds are unreasonably wide. The bounds are determined purely by the response rate and the prevalence among the respondents. That is, a survey with 100 subjects would give the same bounds for a subgroup difference in the proportions as would a survey with 10,000 subjects if the response rate and the prevalence among the respondents in the two surveys were the same. Of course, the confidence interval for the bounds would be larger in the smaller survey. I cannot imagine applying this approach to some other surveys with 70-75% response rates (typical in many national surveys these days). I am afraid that I agree with Cochran (1977) that such an approach is so conservative as to be of little value in most practical settings for inferential purposes. Such bounds, however, can be useful for exploratory analysis to assess the limits. The sensitivity analysis based on pattern mixture model described earlier may be more useful to narrow the bounds (see also Cochran, Mosteller, and Tukey 1954 for other alternatives). Nevertheless, HM's article is useful in illuminating the inherent difficulties in drawing inferences when some data are missing and careful thought is necessary when attempting to draw inference with incomplete data.
Legend for Chart: A - c, d B - Lower limit C - Upper limit A B C c = 0, d = 0 12.4 13.8 c = 0, d = .5 13.2 14.8 c = 0, d = .75 13.7 14.3 c = .5, d = 0 11.4 15.0 c = .5, d = .5 12.3 15.9 c = .5, d = .75 12.7 16.4 c = .75, d = 0 10.6 16.0 c = .75, d = .5 11.5 16.9 c = .75, d = .75 12.0 17.4 c = 2.15, d = 2.15 10.7 26.1
Cochran, W. G. (1977), Sampling Techniques (3rd ed.), New York: Wiley.
Cochran, W. G., Mosteller, F., and Tukey, J. W. (1954), Statistical Problems of the Kinsey Report, Washington, DC: American Statistical Association.
David, M., Little, R. J. A., Samuhal, M. E., and Triest, R. K. (1986), "Alternative Methods for CPS Income Imputation," Journal of the American Statistical Association, 81, 29-41.
Gelman, A., Carlin, J., Stern, H., and Rubin, D. B. (1995), Bayesian Data Analysis. London: Chapman and Hall.
Heckman, J. (1976), "The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models," Annals of Economic and Social Measurement, 5, 475-492.
Kessler, R. C., McGonagle, K. A., Zhao, S., Nelson, C. B., Hughes, M., Eshleman, S., Wittchen, U. H., and Kendler, K. (1994), "Lifetime and 12-Month Prevalence of DSM-III-R Psychiatric Disorders in the United States: Results From the National Comorbidity Survey," Archives of General Psychiatry, 51, 8-19.
Little, R. J. A. (1993), "Pattern-Mixture Models for Multivariate Incomplete Data," Journal of the American Statistical Association, 88, 125-134.
----- (1994), "A Class of Pattern-Mixture Models for Normal Missing Data," Biometrika, 81, 3, 471-483.
Raghunathan, T. E. (1994), "Monte Carlo Methods for Exploring Sensitivity to Distributional Assumptions in a Bayesian Analysis of a Series of 2 x 2 Tables," Statistics in Medicine, 13, 1525-1538.
Raghunathan, T. E., and Siscovick, D. S. (1996), "A Multiple Imputation Analysis of a Case-Control Study of the Risk of Primary Cardiac Arrest Among Pharmacologically Treated Hypertensives," Applied Statistics, 45, 335-352.
Rubin, D. B. (1976), "Inference and Missing Data" (with discussion), Biometrika, 63, 581-592.
----- (1977), "Formalizing Subjective Notions About the Effect of Nonrespondents in Sample Surveys," Journal of the American Statistical Association, 72, 538-543.
Rubin, D. B., Stern, H., and Vehovar, V. (1995), "Handling 'Don't Know' Survey Responses: The Case of the Slovenian Plebiscite," Journal of the American Statistical Association, 90, 822-828.
~~~~~~~~
By T. E. Raghunathan
T. E. Raghunathan is Senior Associate Research Scientist, Institute for
Social Research and Associate Professor of Biostatistics, University of
Michigan, Ann Arbor, MI 48106.
Title: | Reference Bayesian Methods for Generalized Linear Mixed Models. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Bayesian methods furnish an attractive approach to inference in generalized linear mixed models. In the absence of subjective prior information for the random-effect variance components, these analyses are typically conducted using either the standard invariant prior for normal responses or diffuse conjugate priors. Previous work has pointed out serious difficulties with both strategies, and we show here that as in normal mixed models, the standard invariant prior leads to an improper posterior distribution for generalized linear mixed models. This article proposes and investigates two alternative reference (i.e., "objective" or "noninformative") priors: an approximate uniform shrinkage prior and an approximate Jeffreys's prior. We give conditions for the existence of the posterior distribution under any prior for the variance components in conjunction with a uniform prior for the fixed effects. The approximate uniform shrinkage prior is shown to satisfy these conditions for several families of distributions, in some cases under mild constraints on the data. Simulation studies conducted using a logit-normal model reveal that the approximate uniform shrinkage prior improves substantially on a plug-in empirical Bayes rule and fully Bayesian methods using diffuse conjugate specifications. The methodology is illustrated on a seizure dataset. [ABSTRACT FROM AUTHOR] |
AN: | 3031973 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
Bayesian methods furnish an attractive approach to inference in generalized linear mixed models. In the absence of subjective prior information for the random-effect variance components, these analyses are typically conducted using either the standard invariant prior for normal responses or diffuse conjugate priors. Previous work has pointed out serious difficulties with both strategies, and we show here that as in normal mixed models, the standard invariant prior leads to an improper posterior distribution for generalized linear mixed models. This article proposes and investigates two alternative reference (i.e., "objective" or "noninformative") priors: an approximate uniform shrinkage prior and an approximate Jeffreys's prior. We give conditions for the existence of the posterior distribution under any prior for the variance components in conjunction with a uniform prior for the fixed effects. The approximate uniform shrinkage prior is shown to satisfy these conditions for several families of distributions, in some cases under mild constraints on the data. Simulation studies conducted using a logit-normal model reveal that the approximate uniform shrinkage prior improves substantially on a plug-in empirical Bayes rule and fully Bayesian methods using diffuse conjugate specifications. The methodology is illustrated on a seizure dataset.
KEY WORDS: Conjugate prior; Hierarchical models; Jeffreys's prior; Reference prior; Uniform shrinkage prior; Variance components.
A major contribution of Bayesian statistics to data analysis has been methodology associated with two-stage hierarchical models. In the special case of linear mixed models, parametric empirical Bayes methods based on restricted maximum likelihood (REML) estimation are widely used (e.g., PROC MIXED in SAS; lme in S-PLUS). Although successful in linear models, this strategy becomes considerably more difficult in generalized linear mixed models (GLMMs). In these families, fully Bayesian alternatives are attractive because (a) they automatically marginalize over fixed effects, thus producing REML-like solutions; (b) they provide an accounting for the uncertainty in estimating the variance components; and (c) they may be implemented via Markov chain Monte Carlo methods, often through the use of readily available software (e.g., BUGS; Spiegelhalter, Thomas, Best, and Gilks 1996). A requirement of the fully Bayesian approach, however, is the specification of prior distributions for the second-stage variance components. Estimation of these parameters may or may not be of direct interest, but their prior specification is important because of its impact on inferences about the regression coefficients and random effects.
In this article we propose and investigate two reference or "default" priors for the variance components. Desirable properties of such priors are that they lead to a proper posterior and display good frequentist behavior in repeated sampling. (See Kass and Wasserman 1996 for a review of the extensive literature on the general topic of reference priors.) Two choices currently in use are a prior analogous to the conventional 1/Sigma prior in the one-sample normal (Mu, Sigma[sup 2]) problem and diffuse conjugate priors with vague hyperparameter values. However, as is well known for normal mixed models, and as we show here quite generally, the first prior leads to an improper posterior. The diffuse conjugate priors, on the other hand, may lead to badly behaved posterior distributions, as discussed in Section 2.
The priors that we propose are presented in Section 3. In Sections 4 and 5 we discuss propriety issues and the Gibbs sampling implementation. In Section 6 we present a simulation study in which we compare the performance of our priors to that of a plug-in empirical Bayes approach and also to those of fully Bayesian methods using diffuse conjugate priors. In Section 7 we analyze the seizure dataset of Leppik et al. (1987), and in Section 8 we summarize.
The data consist of n observations grouped into I independent clusters of n[sub i] experimental units each. For the jth unit in the ith cluster, let y[sub ij] denote the response and let x[sub is](p x 1) be a vector of explanatory variables. We consider the following class of two-stage models. In the first stage, conditionally on an unobserved cluster-specific random effect b[sub i](q x 1), the y[sub ij] arise independently from an exponential family with mean Mu[sup b, sub ij] and variance v[sup b, sub ij] = Phi v(Mu[sup b, sub ij]), where the dispersion parameter Phi is assumed known. The conditional mean is related to the linear predictor n[sup b, sub ij] = x[sup t, sub ij] Beta + z[sup t, sub ij] b[sub i] by the usual generalized linear model (GLM),
(1) h(Mu[sup b, sub ij]) = Eta[sup b, sub ij],
where h(.) is a monotone differentiable link function with inverse g(.), Beta(p x 1) is a vector of regression coefficients, and z[sub ij](q x 1) is a subset of the x[sub ij] vector. The model specification is completed by a multivariate normal (MVN) distribution b[sub i] is similar to N(0, D) at the second stage, where D is an unstructured variance matrix with elements denoted by Theta[sub ij]. Such models are called random-effect GLMs, or generalized linear mixed models (GLMMs). They provide a flexible framework for capturing within-cluster correlation by allowing heterogeneity in the effect of a subset of covariates (namely z[sub ij]) across the clusters.
By way of notation, we let X[sub i](n[sub i] x p) and Z[sub i] (n[sub i] x q) denote the design matrices for cluster i, and let X(n x p) be a full rank matrix formed by the horizontal concatenation of {X[sub i], i = 1,..., I}.
A Bayesian formulation of the aforementioned model requires the specification of a prior distribution for Beta and D. A convenient strategy, in the absence of subjective prior information, is to choose a prior on Beta that is uniform and independent of D, and to take Pi[sub n](D) proportional to det(D)[sup -(q+1)/2] (Tiao and Tan 1965; Wang, Rutledge, and Gianola 1994; Zeger and Karim 1991). The prior Pi[sub n] is a standard invariant prior for normal responses and is obtained by applying Jeffreys's rule to the second-stage random-effect distribution. It has the advantage of simplifying the full-conditional calculations required by the Gibbs sampler, but unfortunately leads to an improper joint posterior distribution for Beta and D. The following theorem (proven in the Appendix) states this result and is a generalization of theorems of Natarajan and McCulloch (1995) and Hobert and Casella (1996).
Theorem 1. For the GLMM in (1) with Beta and D assumed to be a priori independent, Pi[sub n](D) in conjunction with any prior (proper or improper) for Beta produces an improper joint posterior distribution.
A popular method of avoiding improper posterior distributions has been to use proper conjugate priors that are diffuse (Carlin and Louis 1996; Gelfand, Hills, Racine-Poon, and Smith 1990; Gilks, Richardson, and Spiegelhalter 1996, p. 411; McCulloch and Rossi 1994; Spiegelhalter et al. 1996). The conjugate prior for D[sup -1] is the Wishart W((Rho R)[sup -1], Rho) (e.g., Carlin and Louis 1996, p. 166), and the usual choice is to take Rho = q and take the scale matrix R to be a prior guess of D (Spiegelhalter et al. 1996). When D is univariate, the Wishart reduces to a gamma, which is often taken with small values for the shape and scale. The main advantage offered by these priors is computational, as they simplify the implementation and allow the use of software such as BUGS.
Although conjugate priors may work effectively in some applications, they generally can be problematic for two reasons. First, in the absence of good prior information about D, it is unclear how to specify the elements of R. Our simulation studies in Section 6 show that even for moderate sample sizes, the results can be sensitive to the specific choice of R. Second, exceedingly small specifications of the shape and scale for gamma priors may reduce the rate of convergence of the Gibbs sampler for the full set of parameters due to the "near" impropriety of the resulting posteriors (Natarajan and McCulloch 1998). These difficulties motivated our search for alternate default priors for D.
In this section we develop two priors for the variance D: an approximate uniform shrinkage prior and an approximate Jeffreys prior. The first prior involves a generalization of the "uniform shrinkage" priors shown by Strawderman (1971) to produce Bayes estimates with good frequentist properties. The second prior is constructed by applying Jeffreys's general rule using an approximation to the information matrix for D.
3.1 An Approximate Uniform Shrinkage Prior
A heuristic justification for the uniform shrinkage prior Pi(Theta) proportional to (Phi + Theta)[sup -2] in the simple normal means model y[sub i]|b[sub i] N(Mu[sup b, sub i], Phi), b[sub i] is similar to N(0, Theta), is that it corresponds to placing a uniform prior on the shrinkage parameter s = Phi/(Phi + Theta), which is the weight given to the prior mean in the shrinkage estimate of b[sub i] (the posterior mean). This heuristic suggests a natural way to extend the idea to conjugate two-stage models where the shrinkage parameter may be explicitly evaluated; for example, the Poisson-gamma model (Christiansen and Morris 1997) and the normal-normal model (Daniels 1998). However, for nonconjugate GLMM, closed-form expressions do not exist, making direct extension of uniform shrinkage difficult.
Thus we propose a variant of the foregoing approach that is derived from the approximate shrinkage estimate b[sub i] = DZ[sup t, sub i](W[sup -1, sub i] + Z[sub i]DZ[sup t, sub i])[sup -1](y[sup *, sub i] - Eta[sup 0, sub i]), where W[sub i](n[sub i] x n[sub i]) is the diagonal GLM weight matrix with entries {v[sup 0, sub ij][differential Eta[sup 0, sub ij] /differential Mu[sup 0, sub ij]][sup 2]}[sup -1], y[sup *, sub i] is a working dependent variable, and the superscript 0's indicate the substitution of b with 0 in these quantities (Breslow and Clayton 1993). After some matrix manipulations, b[sub i] may be rewritten as
b[sub i] = S[sub i]0 + (I - S[sub i])DZ[sup t, sub i]W[sub i] (y[sup *, sub i] - Eta[sup 0, sub i]),
with S[sub i] = (D[sup -1] + Z[sup t, sub i]W[sub i]Z[sub i])[sup -1](Z[sup t, sub i]W[sub i]Z[sub i]). The matrix S[sub i] controls the relative contribution of the prior mean to the posterior update, and in this sense may be viewed as a multivariate extension of the parameter s. It may alternately be thought of as the weight given to the prior mean in a normal-normal approximation to the GLMM obtained by replacing the first-stage with a maximum likelihood-based normal approximation.
Our proposal is to induce a prior for D by placing a componentwise uniform distribution on S[sub i]. But because Z[sup t, sub i]W[sub i]Z[sub i] varies with i, we first replace it by its average across the clusters, as was done by Daniels (1998). That is, we place a uniform distribution on
[Multiple line equation(s) cannot be represented in ASCII text],
which, on transforming to D, leads to the approximate uniform shrinkage prior
(2) [Multiple line equation(s) cannot be represented in ASCII text].
Calculation of Pi[sub us] is relatively straightforward, requiring only knowledge of the form of the weight matrix W[sub i] for the GLM under consideration (McCullagh and Nelder 1989, p. 30) and some matrix operations. In addition, it is proper (integrable). The following theorem states this result and is proven in the Appendix.
Theorem 2. The approximate uniform shrinkage prior is proper; that is, *(This character cannot be converted in ASCII text) Pi[sub us](D)dD < infinity.
An important ingredient in (2) is the choice of Beta at which W[sub i] is to be evaluated. We propose using Beta, which is obtained by pooling the clusters and fitting the single-stage GLM with Eta[sub ij] = x[sup t, sub ij]Beta by maximum likelihood. This involves a simple precalculation, which can be accomplished using standard software, such as SAS or S-PLUS. Although we have no formal justification for this choice of Beta, we have found the resulting prior to perform well in a variety of situations. We note that our procedure makes Pi[sub us] depend on the data through the value of Beta. But this is a mild form of data dependence, because W[sub i] is known to vary slowly, or not at all, as a function of the conditional mean (Breslow and Clayton 1993) and consequently as a function of Beta. Further, Natarajan and Kass (1999) have shown that this substitution carries no more information than does a single cluster.
Table 1 displays Pi[sub us] for a one-way random intercept model with normal, Bernoulli, or Poisson first-stage specifications. For the simple normal means model (i.e., n[sub i] = 1 for all i), Pi[sub us] reduces to the uniform shrinkage prior.
3.2 Approximate Jeffreys Prior
A basic method of choosing a prior is to apply Jeffreys's general rule, which is to take Pi(D) proportional to det(I(D))[sup 1/2], where I(D) is the expected Fisher information matrix with Beta held fixed. However, application of this method is often hindered by the difficulty of calculating I(D). For example, in the GLMM, the components of I(D) are given by
(3) [Multiple line equation(s) cannot be represented in ASCII text],
where A[sub j] = D[sup -1](differential D/differential Theta[sub j])D[sup -1] and the inner and outer expectations of the first term are with respect to f(b[sub i]|y[sub i], Beta, D) and f(y[sub i]). Except for y[sub i] is similar to normal, (3) is analytically intractable.
Using methods similar to those used by Breslow and Clayton (1993) in a related problem, we obtain the approximation
(4) [Multiple line equation(s) cannot be represented in ASCII text],
where W[sub i] is as defined in Section 3.1 (see the Appendix for a brief justification). Again, evaluation of (4) involves only straightforward matrix operations. We let Pi[sub j] denote the prior obtained by taking the square root of the determinant of the matrix with components given by the right side of (4). Table 1 displays Pi[sub j] for the one-way layout described in Section 3.1. The approximate Jeffreys prior is improper for all three models and consequently is more diffuse than Pi[sub us].
In this section we state two theorems concerning propriety of the joint posterior distribution of (Beta, D) in the GLMM, based on a uniform prior for Beta and any prior for D. Proofs of these theorems are presented in the Appendix. Theorem 3 gives a sufficient condition for propriety under any first-stage distribution, and Theorem 4 gives both necessary and sufficient conditions for the Bernoulli distribution. Corollaries to these theorems show that the prior Pi[sub us] leads to a proper posterior on (Beta, D) for several common families of distributions, in some cases requiring mild conditions on the data. We have been unable to obtain similar results for Pi[sub j], due to the complicated nature of its dependence on D.
Theorem 3. Suppose that the data arise from the GLMM in (1); then the joint posterior distribution resulting from a prior Pi(Beta, D) proportional to Pi(D) is proper if there exist p full-rank vectors x[sup t, sub k] (k = 1,..., p) such that the integral
[Multiple line equation(s) cannot be represented in ASCII text]
is finite, where r[sub k] = x[sup t, sub k] Beta.
Corollary 1. Under the assumptions of Theorem 3, the list of distributional families for which Pi[sub us] leads to a proper posterior on (Beta, D) includes, but is not limited to, the gamma with canonical or log link, the Poisson with canonical link (if the y corresponding to the full-rank x[sup t, sub k] are nonzero), and the normal and inverse Gaussian families with canonical links.
Theorem 3 is inconclusive for the Bernoulli first-stage specification, so this case is addressed separately in the following theorem.
Theorem 4. Suppose that the data arise from a Bernoulli family. For the joint posterior distribution resulting from a prior Pi(Beta, D) proportional to Pi(D) to be proper, the following condition 1 is necessary, whereas conditions 2 and 3 are jointly sufficient:
1. The system of equations X[sup *]Beta </= 0, Beta not equal to 0 is unsolvable, where X[sup *](n x p) is the matrix with rows (1 - 2y[sub k])x[sup t, sub k].
2. The system of equations X[sup *, sub i]Beta </= 0, Beta not equal to 0 is unsolvable for some i, where X[sup *, sub i] is the matrix with rows (1 - 2y[sub ij])x[sup t, sub ij].
3. The integrals *(This character cannot be converted in ASCII text)[sup 0, sub -infinity](-v)[sup p] dg(v) and *(This character cannot be converted in ASCII text)[sup infinity, sub 0] v[sup p]dg(v) are finite. In addition, the integral I = *(This character cannot be converted in ASCII text) Pi[sub k not equal to i] Pi[sup n[sub k], sub j=1] f(y[sub kj]|Beta, b[sub k] - b[sub i])f(b|D)Pi(D) db dD </= M for some constant M, where i denotes the cluster for which condition 2 is satisfied.
Conditions 1 and 2 of Theorem 4 arise due to the uniform prior on Beta, whereas condition 3 depends on the choice of link function and prior for D. As shown in the following corollary, condition 3 is satisfied for Pi[sub us] under two commonly used link functions.
Corollary 2. Under the assumptions of Theorem 4, and using the logit or probit link functions, condition 1 is necessary and condition 2 is sufficient for Pi[sub us], or any proper prior Pi(D), to lead to a proper posterior distribution on (Beta, D).
Remark 1. Although conditions 1 and 2 appear complicated, they often reduce to simple requirements on the number of successes (y = 1) and failures (y = 0) that must be observed in the dataset. For instance, for the one-way model h(Mu[sup b, sub ij]) = Beta[sub 0] + b[sub i], condition 1 states that there must be at least one success and one failure observed in the dataset, whereas condition 2 states that there must exist a cluster for which a success and a failure are observed. For more complicated models, the mixed linear program of Santner and Duffy (1988) may be used to verify conditions 1 and 2.
Remark 2. A geometric interpretation of condition 1 is that there should be no separating hyperplane between the x[sup t, sub k] that correspond to y = 1 and those that correspond to y = 0. This is equivalent to requiring that the maximum likelihood estimator (MLE) of Beta in the logistic regression model y[sub ij] is similar to iid Bernoulli (logit (x[sup t, sub ij] Beta)) be finite (Albert and Anderson 1984). Hence condition 1 is extremely mild in practice, and the datasets for which it fails to hold could be considered pathological. Condition 2 is also mild as long as n[sub i] is reasonable relative to p.
Remark 3. Condition 2 cannot be satisfied for models where X[sub i] is less than full rank for all clusters. We are not aware of any general sufficient condition that can address this situation; hence these models will need to be handled on a case-by-case basis.
The following result offers a sufficient condition for propriety in one such class of models commonly encountered in comparative experiments.
Result 1. Suppose that the data arise from a Bernoulli family with logit or probit link and with X[sub i] = (1 Trt) (x) C[sub i], where C[sub i](n[sub i] x r, r < p) is a full-rank matrix, Trt is a treatment group indicator and (x) is the direct product operator. The joint posterior distribution resulting from a uniform prior for Beta and a proper prior for D is proper if the following condition 2' holds:
2'. The system of equations C[sup *, sub i] u </= 0, u not equal to 0 is unsolvable for some i in each treatment group, where C[sup *, sub i] is the matrix with rows (1 - 2y[sub ik])c[sup t, sub ik].
We consider this class of models in Section 6 and show that condition 2' reduces to mild requirements on the number of successes and failures observed in each experimental group.
The Gibbs sampler is a natural method for posterior simulation in the GLMM (Zeger and Karim 1991). The algorithm proceeds by iteratively sampling Beta, b[sub i] and D from their respective full-conditional distributions until convergence is achieved. The full conditional distributions of Beta and b[sub i] are given by
[Multiple line equation(s) cannot be represented in ASCII text]
and
[Multiple line equation(s) cannot be represented in ASCII text],
and are identical to those derived by Zeger and Karim (1991). We use the rejection sampling procedure that they outlined in their sections 5.1 and 5.3 for sampling these quantities. The full-conditional specification of D is proportional to
(5) [Multiple line equation(s) cannot be represented in ASCII text].
When Pi(D) is chosen to be an inverted Wishart, then (5) is also an inverted Wishart. However, when Pi(D) is chosen to be Pi[sub us](D) or Pi[sub j](D), then (5) is a nonstandard distribution that cannot be sampled from directly. Thus we adopt the Metropolis-Hastings algorithm (Gilks et al. 1996, p. 5) to generate realizations from (5) when using these priors. A natural choice of a candidate density, as suggested by, for instance, Chib and Greenberg (1995), is q(D) proportional to det(D)[sup -q/2] exp(-1/2 Sigma[sub i] b[sup t, sub i] D[sup -1]b[sub i]). Given a current value D, a new realization is generated by choosing D', sampled from q(.), with probability
A(D, D') = min{1, Pi(D')/Pi(D)},
or keeping D with probability 1 - A(D, D'). Two advantages of our choice of q(.) are that it is an inverted Wishart density, and hence is easily sampled from, and it reduces the calculation of the acceptance probability A to the simple form given earlier.
This section describes simulations run to study the performances of the priors Pi(Beta, D) proportional to Pi[sub us] and Pi(Beta, D) proportional to Pi[sub j] relative to those of a common default Bayesian procedure and a plug-in empirical Bayes (EB) rule. The study used a logit-normal model and closely followed the design specifications of Zeger and Karim (1991), as described. The simulations were implemented using the matrix programming language GAUSS (Aptech Systems 1994).
6.1 Model
Conditionally independent Bernoulli responses y[sub ij] were generated within each of I is an element of {30, 50} clusters with mean logit(Mu[sup b, sub ij]) = Beta[sub 0] + Beta[sub 1]t[sub j] + Beta[sub 3]x[sub i] + Beta[sub 4]x[sub i]t[sub j] + b[sub i0] + b[sub i1]t[sub j], where x[sub i] = I for half the samples and 0 for the rest and t[sub j] = j - 4 for j = 1,..., 7. The true values of the fixed effects were taken to be Beta[sup t] = (-.625, .25, -.25, .125), and the cluster-specific random effects b[sub i] = (b[sub i0], b[sub i1])[sup t] were generated as a sequence of iid normal variates with mean 0 and variance matrix D[sub 1] = (1 0, 0 0) or D[sub 2] = (.50 0, 0 .25). Zeger and Karim in their analysis chose I = 100 and Beta[sup t] = (-2.5, 1, -1, .5); but because the sample sizes that we considered were much smaller than theirs, we took the true values of Beta to be smaller as well, to generate more informative datasets.
6.2 Methods Being Compared
In addition to our priors, we considered two alternative approaches for making inferences about Beta, D, and the b[sub i]: a plug-in EB method and fully Bayesian methods using conjugate priors. The plug-in EB method obtains MLEs Beta and D from a numerical integration estimate of the marginal likelihood function
[Multiple line equation(s) cannot be represented in ASCII text],
and uses them as true values in the posterior distribution of b[sub i]. When 15 is not positive definite, we set the smaller variance, the covariance, and both their standard errors to 0, following Breslow and Clayton (1993).
For the Bayesian analysis, we used a uniform prior for Beta and a conjugate prior for D. When D = D[sub 1], the conjugate prior for Theta[sub 11] is an inverted gamma, which is equivalent to a gamma prior for Theta[sup -1, sub 11]. We chose the shape and scale of the gamma prior both equal to 1E--03, as this specification is widely used as a default prior (see, e.g., Spiegelhalter et al. 1996 and references therein). When D = D[sub 2], the conjugate prior for D[sup -1] is the Wishart with Rho degrees of freedom and scale matrix (Rho R)[sup -1]. It has been suggested that a good default specification is one for which Rho is taken to be as small as possible and R is taken to be a prior guess at D (e.g., Carlin and Louis 1996, p. 168). The rationale for this advice stems from the fact that E[D] approximately equal to R and var[D] is decreasing in Rho. Thus we let Rho = 2 and considered two choices for R: the identity matrix and D[sub 2]. The first choice represents a plausible choice for R in the absence of prior information about D (e.g., Gilks et al. 1996, p. 411; McCulloch and Rossi 1994), and the second is the optimal choice under the foregoing strategy. Note that because the true value of D is not available in practice, our choice of D[sub 2] represents an unattainable 'gold standard' for inference among diffuse conjugate priors.
6.3 Posterior Sampling and Criteria of Comparison
For each of the four combinations of cluster size and variance matrix, we generated 1,000 datasets. We checked the propriety of the posterior distribution for each dataset by verifying condition 2' of Result 1. For this model, condition 2' is equivalent to the existence of a cluster in each of the control (x = 0) and treatment (x = 1) groups for which an alternating sequence of successes and failures are observed at any three time points t[sub j] < t[sub k] < t[sub l]. All datasets generated were found to satisfy this weak condition.
Inferences for the Bayesian methods were based on 2,000 samples generated from the full posterior distribution f(Beta, D, b[sub 1],..., b[sub I]|y) for each dataset. We performed the posterior sampling using the Gibbs sampler and followed the implementation described in Section 5. For the plug-in EB method, inferences about Beta and D were based on the MLE and their associated observed information matrix, whereas inferences about the random effects were based on 2,000 samples generated from f(b[sub 1],..., b[sub I]|y, Beta, D) using the Gibbs sampler.
The operating characteristics used to compare the estimators were the squared-error risk of the posterior mean (i.e., E[(Beta - Beta)[sup t](Beta - Beta)], E[tr((DD[sup -1] - I)[sup 2])], Sigma E[(b[sub i0] -b[sub i0])[sup 2]], and Sigma E[(b[sub i1] - b[sub i1])[sup 2]]), noncoverage of the nominal 95% intervals from the 2.5th to 97.5th sample percentiles (scored as a binary outcome), and average interval width. For ease of reporting, the noncoverage and interval width of predictions of the random effects were averaged over the clusters.
6.4 Results
Tables 2, 3, and 4 display posterior results for D, b[sub i], and Beta. The relative results for the various methods are fairly consistent for I = 30 and I = 50, and hence we do not distinguish between these cases in the presentation of our results. We do note, however, an improvement in operating characteristics for the larger sample size (with a few exceptions). Examination of the tables leads to two general conclusions. First, the prior Pi[sub j] does not appear very competitive. It tends to have consistently wider intervals and poorer risk and has high noncoverage for De. Second, the prior Pi[sub us] has better overall performance than the diffuse conjugate priors and the EB method. Additionally, it behaves in a manner similar to that of the 'gold standard' Wishart prior for the De model, and appears to be slightly better for inferences about the variance components.
We begin with a discussion of the results for D = D[sub 1], and then move on to D = D[sub 2]. The prior Pi[sub us] appears to dominate the diffuse gamma prior in terms of better risk and coverage for D and b[sub i], and has a slight edge overall for inferences about the Beta. Our examination of the posterior samples and likelihood function for each dataset showed that the wider intervals and poorer coverage of the gamma prior were due to the combined influence of its large variance and infinite peak at 0. More specifically, we found that in most of the datasets, the marginal posterior of Theta[sub 11] from the gamma prior tended to have substantially heavier tails, thereby leading to wider intervals on average. But notable exceptions to this rule Were datasets where the likelihood function of Theta[sub 11] had nontrivial mass near 0. In these cases, we found that the infinite peak of the gamma prior had a dramatic influence by skewing the posterior of Theta[sub 11] toward 0. This behavior accounts for the inflated noncoverage probabilities reported in Table 2.
The plug-in EB method produces smaller risks for Beta and Theta[sub 11] compared to Pi[sub us] but roughly the same risks for b[sub i0]. That the EB risks of b[sub i0] are competitive should not come as a surprise, as plug-in methods are known to produce reasonable predictors as long as the hyperparameters can be reliably estimated. [Similar results were found by Christiansen and Morris (1997) for I = 45 and n[sub i] = 3.] The better risk of Theta[sub 11] is due to the tendency of the MLE to be smaller than the posterior mean (based on inspection of the point estimates), coupled with the constraint Theta[sub 11] > 0 that bounds the difference (Theta[sub 11]/Theta[sub 11] - 1) from below.
However, it is this tendency of the MLE that appears to lead to the high noncoverage of Theta[sub 11] reported in Table 2. We found that in the 97 datasets that failed to cover the true value when I = 50, the average Theta[sub 11] was .34, compared with .98 for the 1,000 datasets. The EB intervals for the predictions also suffer from serious undercoverage, as seen in Table 3. The main reason for this behavior is the failure of such plug-in methods to account for the uncertainty of the hyperparameter estimates. Although corrections to inflate the prediction variance have been proposed, we did not investigate them, because they were too difficult to implement for the models considered here.
The problems with the EB method are exacerbated for the model using D[sub 2], as judged by the high noncoverage for all of the parameters and the comparatively poor risk for the predictions in Table 3. Note that the maximization algorithm failed to converge to a positive definite matrix in 20% of the cases when I = 30 and in 5% of the cases when I = 50, which partly explains this behavior. However, comparisons based only on those datasets where convergence to a positive definite matrix was achieved also revealed similar problems.
The results for the Wishart priors of model D[sub 2] illustrate sensitivity to the choice of R. The diffuse Wishart prior centered at the identity matrix behaves similar to the gamma prior, in the sense that both its location and heavier tail prove influential for the designs considered here. This may be seen in the poor risk and dramatic noncoverage for D, but overly conservative intervals for b[sub i]. In contrast, the Wishart prior centered at D[sub 2] offers substantially better results. Of course, as indicated previously, this second Wishart prior is a 'gold standard' with respect to the conventional methods of default conjugate prior selection; hence it provides a standard that is unachievable in practice. In light of this, it is remarkable that the prior Pi[sub us] compares favorably with this 'gold standard' Wishart.
Overall, our results show that for univariate D, the prior Pi[sub us] offers a substantial reduction in the risk and noncoverage of D compared to the conjugate prior. The improvement for b[sub i] is modest, and that for Beta is slight. However, when D is two-dimensional, the gains are more sizeable and impact inferences about D, b[sub i], and the Beta to a greater degree.
To illustrate our methods on longitudinal discrete responses, we revisit the data on seizure counts presented by Leppik et al. (1987) and subsequently analyzed by Diggle, Liang, and Zeger (1994, p. 188). The data come from a randomized double-blind trial designed to study the effect of progabide (PGB) in reducing epileptic seizures. The trial involved two centers, UMN and UVA, but we consider only the data from UMN and focus on comparing Bayesian estimates obtained using Pi[sub us] to those from a diffuse conjugate prior. The data consist of 30 epileptics who were randomized to either PGB (Trt = 1) or a placebo (Trt = 0) following an 8-week baseline period. A total of four postrandomization visits were made, and at each visit, counts of seizures y[sub ij] were recorded for the two previous weeks.
We assumed that the seizure counts y[sub ij] were conditionally (on b[sub i]) independent and drawn from a Poisson distribution with mean
log(Mu[sup b, sub ij] = log(t[sub ij]) + x[sup t, sub ij]Beta + b[sub i0] + b[sub i1] Time + b[sub i2] Trt,
i = 1,..., 30, j = 0, 1,..., 4,
where t[sub ij] = 8 if j = 0 and t[sub ij] = 2 if j = 1,..., 4; b[sub i] = (b[sub i0], b[sub i1], b[sub i2])[sup t] is assumed to follow a MVN distribution with mean 0 and variance D (having components Theta[sub ij]), and x[sub ij] denotes the covariates: an intercept, Time (coded: 0 if baseline and 1 otherwise), treatment group (Trt) and the interaction of Time with Trt (Time x Trt), which have coefficients Beta[sub 0], Beta[sub 1], Beta[sub 2], and Beta[sub 3]. The model is identical to model II of Diggle et al., except that we included an additional random effect b[sub i2] to allow for heterogeneity among subjects in their response to the drug.
For the foregoing model, the treatment effect may be quantified by comparing either the conditional means Mu[sup b, sub ij] or the marginal means Mu[sub ij] between the PGB and placebo groups, where Mu[sub ij] takes the simple form exp(z[sup t, sub ij]Dz[sub ij]/2 + x[sup t, sub ij] Beta). The conditional calculation results in the subject-specific treatment effect Beta[sub 3], which represents the difference in the logarithm of the postrandomization to prerandomization conditional mean of seizure counts in the PGB and placebo groups. A similar calculation with the marginal means produces the population-averaged treatment effect Delta = Beta[sub 3] + Theta[sub 23]. Our reanalysis compares inferences about both these parameters using a conjugate prior and Pi[sub us]. A negative value for Beta[sub 3] or Delta corresponds to a greater reduction in seizure frequency for the PGB group.
The condition in the corollary to Theorem 3 was verified and found to hold, thus assuring a proper posterior under Pi[sub us]. Samples of size 50,000 were generated from the posteriors based on Pi[sub us], and a diffuse conjugate prior. The conjugate prior used here was the same as the one investigated in the simulation study of Section 6--namely, a Wishart prior with Rho = 3 and R equal to the identity matrix. The analysis based on the conjugate prior was performed using the program BUGS. (Because BUGS requires the use of proper priors, we specified normal priors with zero mean and a variance of 1E+04 for the fixed effects.)
Table 5 displays the posterior estimates and 95% intervals for the fixed-effects and variance parameters. Although many of the parameter estimates are in reasonable agreement under both priors, the 95% posterior intervals from the conjugate prior tend to be wider. In particular, conclusions about the subject-specific treatment effect Beta[sub 3] would be different using the two priors. This discrepancy is even more dramatic when one compares the posterior distributions of the population-averaged treatment effect Delta. The posterior probability that the treatment is ineffective (Pr(Delta >/= 0)) is .0736 based on the conjugate prior and .0186 based on Pi[sub us].
This example illustrates that inferences based on diffuse conjugate priors can be substantively different than those based on Pi[sub us]. Based on the better performance of the prior Pi[sub us] in the simulations, we would be inclined to accept the conclusions resulting from this prior over those produced by the conjugate prior.
The main findings of our article are that commonly used methods of choosing "default" priors for variance components can result in misleading inferences, and the approximate uniform shrinkage prior Pi[sub us] appears to offer a reasonable alternative. The method is easy to apply and leads to proper posterior distributions under mild conditions on the data. Our simulations for a logit-normal model showed that the posteriors based on Pi[sub us] provide better coverage and smaller risks than those based on diffuse conjugate priors, particularly for the variance components. The gains for the random-effects and fixed-effects parameters were modest and appeared noticeable only with an increase in the dimension of the random effects. Our real data example illustrated how inferences based on Pi[sub us] can lead to substantively different conclusions than those based on diffuse conjugate priors.
From the results presented here, the approximate uniform shrinkage prior appears quite promising for reference Bayesian data analysis with GLMMs. Furthermore, extensions to other models may be possible via a normal approximation at the first stage, as suggested in Section 3.1.
Two Reference Priors Pi[sub us], Pi[sub j] (up to a Constant of Proportionality) for Normal, Bernoulli, and Poisson First-Stage Specifications With Mean h(Mu[sup b, sub ij]) = Beta + b[sub i], i = 1,...,I, j = 1,..., n[sub j] Sigma n[sub i] = n, Where h(.) is the Canonical Link Function
Legend for Chart: A - First-stage distribution B - Pi[sub us] C - P[sub j] A B C Normal (Phi + n Theta/1)[sup -2] (Phi + n Theta/1)[sup -1 Bernoulli (1 + n exp(Beta)Theta/I(1 + exp(Beta)[sup 2][sup -2] (1 + n exp(Beta)Theta/I(1 + exp(Beta)[sup 2][sup -1] Poisson (1 + n exp(Beta)Theta/I)[sup -2] (1 + n exp(Beta)Theta/I)[sup -1]
Legend for Chart: B - Risk C - Noncoverage Theta[sub 11] D - Noncoverage Theta[sub 12] E - Noncoverage Theta[sub 22] F - Interval width Theta[sub 11] G - Interval width Theta[sub 12] H - Interval width Theta[sub 22] A B C D E F G H D = D[sub 1], I = 30 a. plug-in EB .35 +/- 02 .120 2.06 b. G (1E-03, 1E-03) .66 +/- .05 .102 3.09 c. Pi[sub us] .42 +/- .04 .053 2.77 d. Pi[sub j] .68 +/- .06 .066 3.28 D = D[sub 1], I = 50 a. plug-in EB .20 +/- .01 .097 1.69 b. G (1E-03, 1E-03) .29 +/- .01 .084 2.05 c. Pi[sub us] .22 +/- .01 .048 1.93 d. Pi[sub j] .29 +/- .02 .059 2.12 D = D[sub 2], I = 30 a. plug-in EB 2.14 +/- .13 .220 .210 .190 1.23 .63 .58 b. W (2, (2I)[sup -1]) 7.65 +/- .35 .130 .010 .440 2.56 1.33 1.23 c. W (2, (2D[sup 2])[sup 3.62 +/- .20 .039 .018 .067 2.12 1.10 .95 d. Pi[sub us] 3.10 +/- .19 .035 .029 .041 2.12 1.05 .88 e. Pi[sub j] 10.32 +/- .02 .140 .043 .150 3.13 1.56 1.32 D = D[sub 2]. I = 50 a. plug-in EB 1.24 +/- .05 .238 .134 .181 .89 .48 .36 b. W (2, (2I)[sup -1]) 3.01 +/- .12 .112 .011 .312 1.58 .81 .63 c. W (2, (2D[sup 2])[sup 1.51 +/- .09 .038 .035 .055 1.37 .70 .47 d. Pi[sub us] 1.39 +/- .07 .043 .038 .055 1.39 .86 .45 e. Pi[sub j] 3.04 +/- .17 .093 .061 .118 1.75 .86 .58
Legend for Chart: B - Risk b[sub 0] C - Risk b[sub 1] D - Noncoverage b[sub 0] E - Noncoverage b[sub 1] F - Interval width b[sub 0] G - Interval width b[sub 1] A B C D E F G D = D[sub 1], I = 30 a. plug-in EB 16.20 +/- .15 .115 2.48 b. G (1E-03, 1E-03) 17.06 +/- .18 .072 2.92 c. Pi[sub us] 16.17 +/- .16 .057 2.94 d. Pi[sub j] 16.72 +/- .18 .050 3.09 D = D[sub 1], I = 50 a. plug-in EB 25.00 +/- .18 .082 2.55 b. G (1E-03, 1E-03) 25.63 +/- .19 .063 2.78 c. Pi[sub us] 25.01 +/- .18 .056 2.79 d. Pi[sub j] 25.30 +/- .18 .051 2.87 D = D[sub 2], I = 30 a. plug-in EB 12.11 +/- .12 4.86 +/- .06 .254 .196 1.78 1.21 b. W (2, (2I)[sup -1]) 12.55 +/- .15 5.52 +/- .07 .025 .024 3.05 1.98 c. W (2, (2D[sup 2])[sup -1]) 11.47 +/- .13 4.61 +/- .05 .036 .036 2.74 1.73 d. Pi[sub us] 11.51 +/- .12 4.51 +/- .05 .045 .048 2.67 1.63 e. Pi[sub j] 13.56 +/- .21 5.32 +/- .09 .034 .035 3.05 1.87 D = D[sub 2], I = 50 a. plug-in EB 19.12 +/- .14 7.20 +/- .06 .200 .110 1.86 1.29 b. W (2, (2I)[sup -1]) 18.60 +/- .16 7.74 +/- .07 .032 .032 2.73 1.76 c. W (2, (2D[sup 2])[sup -1]) 17.80 +/- .14 7.00 +/- .06 .046 .044 2.48 1.57 d. Pi[sub us] 18.05 +/- .13 6.98 +/- .05 .058 .054 2.19 1.50 e. Pi[sub j] 19.13 +/- .17 7.42 +/- .07 .046 .043 2.63 1.63 NOTE: The noncoverage rates are accurate to +/- .001 for 5% noncoverage.
Legend for Chart: B - Risk C - Noncoverage Beta[sub 0] D - Noncoverage Beta[sub 1] E - Noncoverage Beta[sub 2] F - Noncoverage Beta[sub 3] G - Interval width Beta[sub 0] H - Interval width Beta[sub 1] I - Interval width Beta[sub 2] J - Interval width Beta[sub 3] A B C D E F G H I J D = D[sub 1], I = 30 a. plug-in EB .45 +/- .01 .070 .053 .064 .055 1.34 .46 1.92 .68 b. G (1E-03, 1E-03) .50 +/- .02 .052 .058 .051 .062 1.58 .49 2.31 .72 c. Pi[sub us] .49 +/- .02 .051 .056 .042 .056 1.56 .48 2.28 .72 d. Pi[sub j] .53 +/- .02 .046 .060 .044 .061 1.65 .49 2.41 .73 D = D[sub 1], I = 50 a. plug-in EB .25 +/- .01 .049 .049 .054 .054 1.05 .36 1.49 .52 b. G (1E-03, 1E-03) .26 +/- .01 .054 .057 .049 .060 1.14 .36 1.65 .54 c. Pi[sub us] .26 +/- .01 .043 .058 .046 .057 1.13 .36 1.64 .53 d. Pi[sub j] .29 +/- .01 .047 .061 .048 .061 1.17 .37 1.70 .54 D = D[sub 2], I = 30 a. plug-in EB .42 +/- .01 .058 .080 .068 .081 1.17 .71 1.66 1.01 b. W(2, (2I)[sup -1] .58 +/- .02 .033 .050 .040 .038 1.61 1.00 2.35 1.44 b. W(2, (2D[sub 2])[sup -1] .49 +/- .02 .031 .066 .036 .049 1.48 .88 2.16 1.25 d. Pi[sub us] .46 +/- .02 .033 .058 .044 .045 1.44 .83 2.12 1.19 e. Pi[sub j] .56 +/- .02 .035 .055 .046 .052 1.63 .97 2.37 1.37 D = D[sub 2], I = 50 a. plug-in EB .23 +/- .01 .065 .064 .070 .061 .88 .54 1.24 0.76 b. W(2, (2I)[sup -1] .30 +/- .01 .047 .048 .031 .045 1.12 .69 1.61 0.99 b. W(2, (2D[sub 2])[sup -1] .26 +/- .01 .044 .069 .041 .051 1.04 .61 1.50 0.87 d. Pi[sub us] .25 +/- .01 .037 .047 .036 .057 1.02 .59 1.47 0.85 e. Pi[sub j] .28 +/- .01 .032 .060 .036 .054 1.10 .65 1.59 0.93 NOTE: The noncoverage rates are accurate to +/- .007 for 5% noncoverage.
Legend for Chart: A - Variable B - BUGS C - Pi[sub us] A B C Constant (Beta[sub 0]) .92 .86 (.39, 1.42) (.57, 1.16) Time (Beta[sub 1]) .22 .22 (-.19, .63) (-.09, .52) Trt (Beta[sub 2]) .15 .17 (-.51, .84) (-.26, .62) Time x Trt (Beta[sub 3]) .46 .47 (-.99, .06) (-.87, -.09) Square root of Theta[sub 11] .89 .49 (.59, 1.28) (.33, .71) Theta[sub 11] .08 .07 (-.40, .18) (-.26, .09) Theta[sub 13] .37 .09 (-1.19, .10) (-.35, .15) Square root of Theta[sub 22] .59 .46 (.43, .79) (.30, .68) Theta[sub 23] .02 .04 (-.28, .34) (-.16, .26) Square root of Theta[sub 33] .90 .82 (.52, 1.42) (.43, 1.36) Delta .40 .43 (-.93, .14) (-.85, -.02)
Albert, A., and Anderson, J. A. (1984), "On the Existence of Maximum Likelihood Estimates in Logistic Regression," Biometrika, 71, 1-10.
Aptech Systems (1994), The GAUSS System Version 3.2.23, Maple Valley, WA: Author.
Breslow, N. E., and Clayton, D. G. (1993), "Approximate Inference in Generalized Linear Mixed Models," Journal of the American Statistical Association, 88, 9-25.
Carlin, B. P., and Louis, T. A. (1996), Bayes and Empirical Bayes Methods for Data Analysis, London: Chapman and Hall.
Chib, S., and Greenberg, E. (1995), "Understanding the Metropolis-Hastings Algorithm," The American Statistician, 49, 327-335.
Christiansen, C. L., and Morris, C. N. (1997), "Hierarchical Poisson Regression Modeling" Journal of the American Statistical Association, 92, 618-632.
Daniels, M. J. (1999), "A Prior for the Variance in Hierarchical Models" The Canadian Journal of Statistics, 27, 569-580.
Diggle, P. J., Liang, K-Y., and Zeger, S. L. (1994), Longitudinal Data Analysis, New York: Oxford University Press.
Gelfand, A. E., Hills, S. E., Racine-Poon, A., and Smith, A. F. M. (1990), "Illustration of Bayesian Inference in Normal Data Models Using Gibbs Sampling" Journal of the American Statistical Association, 85, 972-985.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London: Chapman and Hall.
Hobert, J. P., and Casella, G. (1996), "The Effect of Improper Priors on Gibbs Sampling in Hierarchical Linear Mixed Models" Journal of the American Statistical Association, 91, 1461-1473.
Kass, R. E., and Wasserman, L. (1996), "Selection of Prior Distributions by Formal Rules" Journal of the American Statistical Association, 91, 1343-1370.
Leppik, I. E., Dreifuss, F. E., Porter, R., Bowman, T., Santilli, N., Jacobs, M., Crosby, C., Cloyd, J., Stackman, J., Graves, N., Sutula, T., Welty, T., Vickery, J., Brundage, R., Gates, J., Gumnit, R. J., and Gutierrez, A. (1987), "A Controlled Study of Progabide in Partial Seizures," Neurology, 37, 963-968.
MathSoft, Inc. (1996), S-PLUS Version 3.4, StatSci Division, Seattle, WA: Author.
McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models (2nd ed.), London: Chapman and Hall.
McCulloch, R., and Rossi, P. E. (1994), "An Exact Likelihood Analysis of the Multinomial Probit Model," Journal of Econometrics, 64, 207-240.
Natarajan, R., and McCulloch, C. E. (1995), ".4 Note on the Existence of the Posterior Distribution for a Class of Mixed Models for Binomial Responses," Biometrika, 82, 639-643.
----- (1998), "Gibbs Sampling With Diffuse Proper Priors: A Valid Approach to Data-Driven Inference?," Journal of Computational and Graphical Statistics, 7, 267-277.
Natarajan, R., and Kass, R. E. (1999), "A Default Conjugate Prior for Variance Components in Generalized Linear Mixed Models" unpublished manuscript.
Santner, T. J., and Duffy, D. E. (1986), "A Note on A. Albert and J. Anderson's Conditions for the Existence of Maximum Likelihood Estimates in Logistic Regression Models," Biometrika, 73, 755-758.
SAS Institute, Inc. (1997), SAS/STAT Software: Changes and Enhancements Through Release 6.12, Cary, NC: Author.
Spiegelhalter, D. J., Thomas, A., Best, N. G., and Gilks, W. R. (1996), BUGS: Bayesian Inference Using Gibbs Sampling, Version 0.50, Cambridge, U.K.: MRC Biostatistics Unit.
Strawderman, W. E. (1971), "Proper Bayes Estimators of the Multivariate Normal Mean," The Annals of Mathematical Statistics, 42, 385-388.
Tiao, G. C., and Tan, W. Y. (1965), "Bayesian Analysis of Random-Effects Models in the Analysis of Variance. I. Posterior Distribution of Variance Components" Biometrika, 52, 37-53.
Wang, C. S., Rutledge, J. J., and Gianola, D. (1994), "Bayesian Analysis of Mixed Linear Models via Gibbs Sampling With an Application to Litter Size of Iberian Pigs," Genetique, Selection, Evolution, 26, 1-25.
Zeger, S. L., and Karim, M. R. (1991), "Generalized Linear Models With Random Effects: A Gibbs Sampling Approach," Journal of the American Statistical Association, 86, 79-86.
Justification of the Approximate Information Matrix
Assuming that V(b[sub i]) = var[b[sub i]|y[sub i], Beta, D] does not depend on y[sub i] (as in the normal mixed model), we can reduce the first term in (3) to Sigma[sub i] tr(A[sub j]V(b[sub i]))Sigma[sub i])) tr(A[sub k]V(b[sub i])) + Sigma[sub i] tr(A[sub j]V(b[sub i]))Sigma[sub i] E(E[sup t](b[sub i])A[sub k]E(b[sub i])] + Sigma[sub i]tr(A[sub k]V(b[sub i]))Sigma[sub i] E(E[sup t](b[sub i])A[sub j]E(b[sub i])] + Sigma[sub i] Sigma[sub i], E(E[sup t](b[sub i])A[sub j]E(b[sub i])E[sup t](b[sub i']A[sub k] E(b[sub i'])], where E(b[sub i]) = E(b[sub i])|y[sub i], Beta, D]. A common approximation strategy is to replace the random-effect moments with the mode and curvature of f(b[sub i]|y[sub i], Beta, D). More specifically, we replace E(b[sub i]) with b[sub i], and V(b[sub i]) with the corresponding curvature estimate C[sub i] = (D[sup -1] + Z[sup t, sub i]W[sub i]Z[sub i])[sup -1]. The expectation E[b[sup t, sub i]A[sub j]b[sub i]] can then be calculated using standard results for quadratic forms, and requires the first two marginal moments of y[sub i]. Approximations to these may be obtained from a Taylor series expansion of the link function around b[sub i] = 0 to give E[y[sub i]] approximately equal to Mu[sup 0, sub i] and var[y[sub i]] approximately equal to V[sub i] + Delta[sup -1, sub i]Z[sub i]D Z[sup t, sub i]Delta[sup -1, sub i], where V[sub i](n[sub i] x n[sub i]) and Delta[sub i](n[sub i] x n[sub i]) are diagonal matrices with elements v[sup 0, sub ij] and {differential Eta[sup 0, sub ij]/ differential Mu[sup 0, sub ij]}. The cross-product term E[b[sup t, sub i]A[sub j]b[sub i]b[sup t, sub i], A[sub k]b[sub i']] is approximated using normal theory results. These approximations lead to the expression in (4).
Proof of Theorem 1
The posterior distribution of Beta and D is proper iff the marginal distribution of the data
(A.1) [Multiple line equation(s) cannot be represented in ASCII text]
is integrable. Integrating the right side of (A.1) over D and using Hadamard's inequality [det(Sigma[sup I, sub i=1]b[sub i]b[sup t, sub i]) </= Pi[sup q, sub k=1](Sigma[sup I, sub i=1]b[sup 2, sub ik]), where b[sub ik] denotes the k-th component of b[sub i]] gives
[Multiple line equation(s) cannot be represented in ASCII text].
We make the transformation u[sub 1] = b[sub 11] and u[sub i] = b[sub i1]/b[sub 11] for i = 2, .., I. The Jacobian of this transformation is u[sup I-1, sub 1], leading to
[Multiple line equation(s) cannot be represented in ASCII text],
where b[sup *t, sub 1] = {u[sub 1], b[sub 12], ..., b[sub 1q]} and b[sup t, sub i] = {u[sub 1]u[sub i], b[sub i2], ..., b[sub iq]} for i = 2, ..., I. It is clear that the integral over u[sub 1] diverges, because 1/u[sub 1] is nonintegrable in a neighborhood of 0.
Proof of Theorem 2
We prove that a uniform prior on the space of positive definite matrices of the form S [i.e., with all principal minors (pm) less than unity] is integrable. More specifically, we show that *(This character cannot be converted in ASCII text) dS < infinity, where R = {S: all pm are positive and less than unity}. Because R is a subset of R[sub 1] = {S: pm of order 1 and 2 are positive and less than unity}, *(This character cannot be converted in ASCII text) dS can be bounded above by (Multiple lines cannot be converted in ASCII text) where I(.) is the indicator function. Integrating over the s[sub ij] gives (Multiple lines cannot be converted in ASCII text), which is finite.
Proof of Theorem 3
The proof rests on bounding re(y) from above by Epsilon. We first examine
[Multiple line equation(s) cannot be represented in ASCII text].
Make the transformation r[sub k] = x[sup t, sub k]Beta for any p full-rank design vectors x[sup t, sub k] (which are assumed to exist because rank(X) = p). The Jacobian of this transformation is det(X[sup -1, sub *]), where X[sub *](p x p) has rows x[sup t, sub jk]. Then ignoring the indices k that are not in X[sub *] (because the individual components in the likelihood are bounded, these indices can be ignored), we see that re(y) is bounded above by an expression proportional to Epsilon.
Proof of Corollary to Theorem 3
For the gamma family with canonical link Mu[sub k] = -1/(x[sup t, sub k]Beta + z[sup t, sub k]b) we can write
f(y[sub k]|r[sub k], b) proportional to exp(Phi[sup -1]y[sub k](r[sub k] + z[sup t, sub k]b)){-(r[sub k] + z[sup t, sub k]b)}[sup 1/Phi
if r[sub k] + z[sup t, sub k]b , 0 otherwise. Thus
[Multiple line equation(s) cannot be represented in ASCII text].
On making the transformation u[sub k] = -(r[sub k] + z[sup t, sub k]b), we obtain
[Multiple line equation(s) cannot be represented in ASCII text],
which is finite. For the log link ln(Mu[sub k]) = x[sup t, sub k]Beta + z[sup t]b, a similar proof follows after writing f(y[sub k]|r[sub k], b) proportional to exp(-Phi[sup -1](y[sub k] exp(-r[sub k] - z[sup t, sub k]b) - r[sub k] - z[sup t, sub k]b)) and using the transformation u[sub k] = exp(-r[sub k] - z[sup t, sub k]b).
For the inverse Gaussian family with canonical link Mu[sup 2, sub k] = -1/(2(x[sup t, sub k]Beta + z[sup t]b)), we can write f(y[sub k]|r[sub k], b) proportional exp(y[sub k](r[sub k] + z[sup t, sub k]b) + square root of -2(r[sub k] + z[sup t, sub k]b/Phi) if r[sub k] + z[sup t, sub k]b < 0 and 0 otherwise. Thus
[Multiple line equation(s) cannot be represented in ASCII text].
Making the transformation u[sub k] = -(r[sub k] + z[sup t, sub k]b), we obtain
[Multiple line equation(s) cannot be represented in ASCII text],
which is finite.
For the Poisson family with canonical link ln(Mu[sub k]) = x[sup t, sub k]Beta + z[sup t, sub k]b, we can write
f(y[sub k]|r[sub k], b) proportional to exp(y[sub k](r[sub k] + Z[sup t, sub k]b) - exp(r[sub k] + z[sup t, sub k]b)).
Then
[Multiple line equation(s) cannot be represented in ASCII text].
Making the transformation Mu[sub k] = exp(r[sub k] + z[sup t, sub k]b), we obtain
[Multiple line equation(s) cannot be represented in ASCII text],
which is finite if the y[sub k] are all nonzero.
For the normal family with canonical link Mu[sub k] = x[sup t, sub k]Beta + Z[sup t, sub k]b, we can write f(y[sub k]|r[sub k], b) proportional to exp(-(1/2Phi)(y[sub k] - r[sub k] - z[sup t, sub k]b)[sup 2]). Thus
[Multiple line equation(s) cannot be represented in ASCII text]
which is readily seen to be finite.
Proof of Theorem 4
Necessary. Suppose that there exist Beta such that X[sup *]Beta < 0. De note this set of Beta by Beta. It is clear that B is a convex cone in R[sup p] and hence has infinite Lebesgue measure. Thus we can write
(A.2) [Multiple line equation(s) cannot be represented in ASCII text],
where (A.2) follows using the definition of B and the integrand is positive and monotonic. It is easy to see that the right side of (A.2) diverges due to the integral over Beta. A proof similar to the one outlined by Albert and Anderson (1984) follows when the only Beta that violate the necessary condition are the ones for which the equality is satisfied for some row.
Sufficient. We first note that
[Multiple line equation(s) cannot be represented in ASCII text]
serves as an upper bound for re(y), where i denotes the cluster for which the sufficient condition is satisfied. This follows on (a) making the transformation w[sub j] = Beta[sub j] + b[sub ij], (j = 1, ..., q), w[sub j] = Beta[sub j], (j = q + 1, ..., p), and (b) bounding the integral over the b[sub i] and D by M. By condition 2, for every Beta, there is a j for which x[sup *[sup t], sub ij]Beta > 0. This implies that for ever Beta, there is a j such that Theta(x[sup *, sub ij], Beta) < Pi/2, where Theta(.,.) denotes the angle between the two vectors. By a simple application of the Heine-Borel theorem, it can be shown that for every Beta there is a j such that Theta(x[sup *, sub ij], Beta) < Alpha, where Alpha < Pi/2. Define A[sub ij] = {Beta: Theta(x[sup *, sub ij], Beta) < Alpha}. Then
[Multiple line equation(s) cannot be represented in ASCII text],
where we have used the boundedness of the likelihood to retain only the jth row of the ith cluster. It suffices to inspect the integrability of one of the integrals in the foregoing summation, call it T[sub ij]. Because for any Beta in A[sub ij], x[sup *[sup t], sub ij]Beta > c[sub i](Beta[sup t]Beta)[sup 1/2], where c[sub i] = (x[sup *[sup t], sub i] x[sup *, sub i])[sup 1/2] cos(Alpha), we have
[Multiple line equation(s) cannot be represented in ASCII text].
On switching to polar coordinates, the integral T[sub ij] is of the form *(This character cannot be converted in ASCII text) (-v)[sup p]dg(v) when y[sub ij] = 1 and *(This character cannot be converted in ASCII text) v[sup p]dg(v) when y[sub ij] = 0. This is finite by condition 3.
Proof of Corollary to Theorem 4
The integral *(This character cannot be converted in ASCII text) v[sup p] dg(v) is readily seen to be finite when g(.) is the logit or probit link function. Further, Z may be bounded above by 1 when Pi(D) is proper.
Proof of Result 1
The likelihood function may be separated into the product of contributions from each experimental group. On applying the arguments outlined in the proof of the sufficient condition of Theorem 4 to each group, it may be shown that condition 2' is sufficient for integrability.
[Received July 1998. Revised July 1999.]
~~~~~~~~
By Ranjini Natarajan and Robert E. Kass
Ranjini Natarajan is Assistant Professor, Division of Biostatistics,
Department of Statistics, University of Florida, Gainesville, FL 32611 (Email:
ranjini@stat.ufl.edu). Robert E. Kass is Professor, Department of Statistics,
Carnegie-Mellon University, Pittsburgh, PA 15213 (E-mail: kass@stat.cmu.edu).
Title: | Markov Chain Monte Carlo (Book Review). |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | Reviews the book `Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference,' by Dani Gamerman. |
AN: | 3032177 |
ISSN: | 0162-1459 |
Database: | Business Source Premier |
Dani GAMERMAN. London: Chapman and Hall, 1997. ISBN 0-412-81820-5. xiii + 245 pp. $44.95 (P).
This book presents a concise account of the computational tools and software that facilitate contemporary Bayesian inference. It is based on a short course taught by the author and claims to be largely self-contained. This is true in the sense that most of the key background material is covered in limited detail. Of course, at 245 pages, the book is not exhaustive in its coverage of the field. The first two chapters introduce stochastic simulation and the main concepts involved in the Bayesian approach to inference. The remainder of the book concentrates on various available methods that facilitate Bayesian computations. In particular, Chapter 3 gives a broad overview of relevant methods, covering asymptotic approximation, numerical quadrature, and simulation-based methods. Chapters 4-6 significantly expand on the latter topic, first reviewing some basic theory of Markov chains and then moving onto Gibbs sampling and Metropolis-Hastings algorithms. Aside from basic tenets of these iterative simulation methods, topics covered in these chapters include convergence diagnostics, benefits of reparameterization, and hybrid algorithms. The book concludes with a smattering of topics related to model selection and model adequacy, two topics of active research.
A nice feature is the book's coverage of relatively recent developments in computational Bayesian inference, evident in the rather significant number of recent references provided. The entire book is written at an accessible mathematical level. Although the author claims that a course based on this book could be offered after a single course in probability, my own recommendation would be that students have some prior exposure to both probability and inference and some basic computing skills. Each chapter contains significant number of problems (from 9 to 19) requiring a moderate range of ability and mathematical sophistication.
With the recent flurry of good books on Bayesian methods and computation (many from the same publishing house), a reasonable question is whether this book really carves out a unique niche. The answer seems unclear. The recent books of Carlin and Louis (1996) and Gelman, Carlin, Stern, and Rubin (1995) are oriented more toward the practicing statistician and contain a wealth of real data examples, a feature not found in this book. Both also cover the main tools of Bayesian computation (e.g., asymptotic approximation, quadrature, and simulation), though perhaps with a slightly less contemporary flair. Consequently, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference would not be my first choice as a textbook for teaching the basic principles of Bayesian inference and data analysis. However, it might prove to be a better choice as a text for a "how to"-oriented course on Bayesian computational methods. In this regard, Gamerman's nicely focused, elementary-level coverage of the relevant material makes this book a more suitable choice for an introductory course than, say, the text of Gilks, Richardson, and Spiegelhalter (1996). However, it is not clear that this book will be as valuable as that text to statisticians presently working in this field.
In short, this book may have some difficulty finding its own niche among statisticians. However, it is certainly worth a look if you have been searching for a focused introduction to contemporary developments in Bayesian computation.
Carlin, B., and Louis, T. (1996), Bayes and Empirical Bayes Methods for Data Analysis, London: Chapman and Hall.
Gelman, A., Carlin, J. B., Stern, H., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman and Hall.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London: Chapman and Hall.
~~~~~~~~
By Robert L. Strawderman, University of Michigan-Ann Arbor
Title: | Fitting Mixed-Effects Models Using Efficient EM-Type Algorithms. |
Subject(s): | |
Source: | |
Author(s): | |
Abstract: | In recent years numerous advances in EM methodology have led to algorithms which can be very efficient when compared with both their EM predecessors and other numerical methods (e.g., algorithms based on Newton-Raphson). This article combines several of these new methods to develop a set of mode-finding algorithms for the popular mixed-effects model which are both fast and more reliable than such standard algorithms as proc mixed in SAS. We present efficient algorithms for maximum likelihood (ML), restricted maximum likelihood (REML), and computing posterior modes with conjugate proper and improper priors. These algorithms are not only useful in their own fight, but also illustrate how parameter expansion, conditional data augmentation, and the ECME algorithm can be used in conjunction to form efficient algorithms. In particular, we illustrate a difficulty in using the typically very efficient PXEM (parameter-expanded EM) for posterior calculations, but show how algorithms based on conditional data augmentation can be used. Finally, we present a result that extends Hobert and Casella's result on the propriety of the posterior for the mixed-effects model under an improper prior, an important concern in Bayesian analysis involving these models that when not properly understood has lead to difficulties in several applications. [ABSTRACT FROM AUTHOR] |
AN: | 3015915 |
ISSN: | 1061-8600 |
Database: | Business Source Premier |
In recent years numerous advances in EM methodology have led to algorithms which can be very efficient when compared with both their EM predecessors and other numerical methods (e.g., algorithms based on Newton-Raphson). This article combines several of these new methods to develop a set of mode-finding algorithms for the popular mixed-effects model which are both fast and more reliable than such standard algorithms as proc mixed in SAS. We present efficient algorithms for maximum likelihood (ML), restricted maximum likelihood (REML), and computing posterior modes with conjugate proper and improper priors. These algorithms are not only useful in their own fight, but also illustrate how parameter expansion, conditional data augmentation, and the ECME algorithm can be used in conjunction to form efficient algorithms. In particular, we illustrate a difficulty in using the typically very efficient PXEM (parameter-expanded EM) for posterior calculations, but show how algorithms based on conditional data augmentation can be used. Finally, we present a result that extends Hobert and Casella's result on the propriety of the posterior for the mixed-effects model under an improper prior, an important concern in Bayesian analysis involving these models that when not properly understood has lead to difficulties in several applications.
Key Words: EM algorithm; ECME algorithm; Gaussian hierarchical models; Posterior inference; PXEM algorithm; Random-effects models; REML; Variance-component models; Working parameters.
The EM algorithm (Dempster, Laird, and Rubin 1977) has long been a popular tool for statistical analysis in the presence of missing data or in problems that can be formulated as such. Fitting mixed-effects models is among the most important uses of the EM algorithm as illustrated by the great variety and number of applications (see, e.g., Meng and Pedlow 1992; Meng and van Dyk 1997) and its development in the statistical literature (e.g., Laird 1982; Laird and Ware 1982; Dempster, Selwyn, Patel, and Roth 1984; Laird, Lange, and Stram 1987; Liu and Rubin 1994; and Meng and van Dyk 1998). There is no doubt that the reason for EM's popularity compared with other numerical methods (e.g., Newton-Raphson, as developed by Thompson and Meyer 1986; Lindstrom and Bates 1988; and Callahan and Harville 1991; see also Harville 1977) which can be much faster than the early EM implementations is EM's superior stability properties (e.g., monotone convergence in log-likelihood or log posterior). For example, without extra computational effort, such numerical methods can converge to negative variance estimates (e.g., Thompson and Meyer 1986; Callahan and Harville 1991). Even implementations released in standard software which incorporate special monitoring can converge to a point outside the parameter space (e.g., SAS; see Section 4) or to the wrong point within the parameter space (e.g., S-Plus; see Meng and van Dyk 1998 and Section 4). The primary goal of this article is to build algorithms that are very fast but maintain the stability properties of EM-type algorithms.
This goal maintains the spirit of several recent advances in EM methodology. For example, Meng and van Dyk (1998) (see also Foulley and Quaas 1995) developed an alternative EM-type implementation (i.e., an ECME algorithm) for the mixed-effects model that substantially reduced the computational effort for obtaining maximum likelihood (ML) and restricted maximum likelihood (REML) estimates compared with earlier implementations (e.g., Laird, Lange, and Stram 1987). This adaptation was further improved by Liu, Rubin, and Wu (1998) in the special case of ML estimation with univariate response (i.e., a type of regression with heterogeneous residual variance) using the PXEM algorithm. In this article, we show how PXEM can be used for ML and REML model fitting of much more general mixed-effects models. We also show how ECME methodology can be used to eliminate data augmentation for the residual variance parameter in addition to the fixed-effects parameters while maintaining an algorithm which is completely in closed form. This extends Liu and Rubin's (1995) ECME algorithm which starts with the less efficient algorithm of Laird, Lange, and Stram (1987) and requires nested iterations for the ECME update of the residual variance.
This article is organized into four additional sections. In Section 2, after a brief review of EM, ECME (Liu and Rubin 1995), and working parameters (Meng and van Dyk 1997), we extend parameter-expanded EM (PXEM; Liu, Rubin, and Wu 1998) to compute posterior modes and illustrate why this efficient algorithm can be difficult to use in this setting. Section 3 uses the methods developed and reviewed in Section 2 to construct several new algorithms for fitting variations of the mixed-effects model. Section 4 illustrates briefly the computational speed and stability of the algorithms relative to commercially available software. Finally, Section 5 contains concluding remarks and an Appendix proves a result on the propriety of the posterior distributions when using certain improper priors.
2.1 BACKGROUND ON EM-TYPE ALGORITHMS
The EM algorithm is designed to compute a (local) mode, Theta[sup *], of g(Theta|Y[sub obs]) = log p(F[sub obs]|Theta) + log p(Theta), where the parameter Theta is allowed to vary over some space Theta and Y[sub obs] is the observed data. For likelihood calculations, log p(Theta) = Theta for all Theta is an element of Theta and l(Theta|F[sub obs]) is the log-likelihood; for Bayesian calculations l(Theta|Y[sub obs]) refers to the log posterior. Throughout this article, the notation l is used for a log posterior or a log-likelihood.
A data-augmentation scheme, P(Y[sub aug]|Theta) is a model defined so that
(2.1) [Multiple line equation(s) cannot be represented in ASCII text]
where M is some many-to-one mapping. EM iteratively computes Theta by setting Theta[sup (t+1)] = argmax[sub Theta is an element of Theta]Q(Theta|Theta[sup (t)]), where Q(Theta|Theta[sup (t)]) = [Multiple line equation(s) cannot be represented in ASCII text], with l(Theta|Y[sub aug]) = log p(Y[sub[sub aug]|Theta) + log p(Theta). (Here and henceforth integration over Y[sub aug] is over the set {Y[sub aug]: M(Y[sub aug]) = Y[sub obs]}.) Computing the expectation, Q(Theta|Theta[sup (t)]), is known as the E-step, while the maximization operation is known as the M-step. It can be shown that this procedure assures that l(Theta[sup (t+1)] |Y[sub obs]) >/= l(Theta[sup (t)]| Y[sub obs]) and typically converges to a (local) maximum of l(Theta|Y[sub obs]) (Dempster, Laird, and Rubin 1977; Wu 1983).
The choice of the data-augmentation scheme, p(Y[sub aug]|Theta), in (2.1) is not unique. In fact, this choice can greatly affect the rate of convergence of EM-type algorithms (Meng and van Dyk 1997; Liu, Rubin, and Wu 1998) and their stochastic counterparts such as the data augmentation algorithm (Tanner and Wong 1987; Meng and van Dyk 1999; Liu and Wu 1999; van Dyk and Meng in press). In order to choose p(Y[sub aug]|Theta) to result in efficient algorithms, the working parameter approach (e.g., Meng and van Dyk 1999) parameterizes the data-augmentation scheme so that
(2.2) [Multiple line equation(s) cannot be represented in ASCII text]
for each Alpha in some class, A. (Likewise, we sometimes index Q(Theta|Theta') by the working parameter for clarity--that is, Q[sub Alpha] (Theta|Theta').) The method of conditional augmentation suggests choosing Alpha to minimize (i.e., optimize) the global rate of convergence (e.g., Meng and Rubin 1994) of EM--that is, the largest eigenvalue of the matrix fraction of missing information,
(2.3) DM[sup EM](Alpha) = I - I[sub obs]I[sup -, sub aug](Alpha),
where I[sub obs] is the observed Fisher information and
I[sub aug](Alpha) = E[Differential[sup 2]l(Theta(Y[sub aug], Alpha)/ Differential Theta x Differential Theta[sup T]|Y[sub obs], Theta]|[sub Theta=Theta[sup *]],
is the expected augmented Fisher information. Note that we adopt the traditional terms (e.g., Fisher information) of the EM literature, which focuses on likelihood calculations, even though we also deal with more general posterior computations. If we choose Alpha to minimize I[sub aug](Alpha) in a positive semidefinite ordering sense we optimize the global rate of convergence. This idea has led to a number of very efficient algorithms for fitting multivariate t models, probit regression models, mixed-effects models, Poisson models for image reconstruction, and factor analysis models either directly (Fessler and Hero 1994, 1995; Meng and van Dyk 1997, 1998; van Dyk in press a) or indirectly through the PXEM algorithm (which is discussed in detail in Section 2.2; see also Liu, Rubin, and Wu 1998).
Liu and Rubin (1994) also realized that reducing I[sub aug](Alpha) is the key to speeding up EM. In their ECME algorithm, the augmented information is reduced to the observed information for some of the parameters. For example, a simple ECME algorithm dichotomizes the model parameter, Theta = (Theta[sub 1],Theta[sub 2]). The M-step of EM is then broken into two steps. In the first step we set Theta[sup (t+1/2)] to the maximizer of Q(Theta|Theta[sup (t)]) as a function of Theta, subject to the constraint Theta[sub 2] = Theta[sup (t), sub 2]. In the second step Theta is set to the maximizer l(Theta|Y[sub obs]) subject to the constraint Theta[sub 1] = Theta[sup (t+1/2), sub 1]. Since there is no data augmentation in the second step, we expect the algorithm to converge more quickly than EM, as was verified by Liu and Rubin (1994, 1995) in several examples.
We use both ECME and conditional augmentation to build efficient algorithms for fitting mixed-effects models in Section 3. First, however, we turn our attention to an important extension of conditional augmentation, namely the PXEM algorithm.
2.2 PXEM FOR BAYESIAN CALCULATIONS
Liu, Rubin, and Wu (1998) presented the PXEM algorithm as a fast adaptation of conditional augmentation in the special case when p(Theta) proportional to 1; for example, in maximum likelihood estimation. Simply speaking, instead of conditioning on an optimal value of the working parameter, a, PXEM fits Alpha in the iteration. Here we outline a generalization of this algorithm by using this same novel idea to compute posterior modes. We start by defining Q[sub px](Theta, Alpha|Theta', Alpha[sub 0]) = [Multiple line equation(s) cannot be represented in ASCII text]. We then define (Theta[sup (t+1)], Alpha[sup (t+1)]) as the maximizer of Q[sub px](Theta, Alpha|Theta[sup (t)], Alpha[sub 0]), where Alpha[sub 0] is some fixed value. (Note Alpha[sup (t+1)] is not used subsequently.) Since p(Y[sub obs]|Theta) = p(Y[sub obs]Theta, Alpha) = p(Y[sub aug]|Theta, Alpha)/p(Y[sub aug]|Y[sub obs], Theta, Alpha) for any Alpha is an element of A and any Y[sub aug] such that M(Y[sub aug]) = Y[sub obs], we have
[Multiple line equation(s) cannot be represented in ASCII text]
Since the first term on the right is maximized by (Theta, Alpha) = (Theta[sup (t+1)], Alpha[sup (t+1)]), and the second is minimized by (Theta, Alpha) = (Theta[sup (t)], Alpha[sub 0]), we have l(Theta[sup (t+1)] [Y[sub obs]) >/= l(Theta[sup (t)]|Y[sub obs]). That is, this generalization of the PXEM algorithm converges monotonically in log posterior. Following Wu (1983) we can further obtain that this algorithm converges to a stationary point or local maximum of the posterior. These results hold for any value of a0 such that all quantities exist. In fact, the value of Alpha[sub 0] is generally irrelevant for a PXEM iteration and is simply set to some convenient value (e.g., Alpha[sub 0] - 1 for scale working parameters and Alpha[sub 0] = 0 for location working parameters).
We expect this algorithm to perform at least as well as an algorithm that fixes Alpha (i.e., conditional data augmentation) in terms of the global rate of convergence because it essentially removes the conditioning on Alpha in the data-augmentation scheme. Removing this conditioning reduces I[sub aug] (in a positive semidefinite ordering sense) and thus improves the rate of convergence of EM (see Meng and van Dyk 1997 and Liu, Rubin, and Wu 1998 for details). Liu, Rubin, and Wu (1998) gave an alternative explanation--that by fitting a, we are performing a covariance adjustment to capitalize on information in the data-augmentation scheme. They also illustrated the substantial computational advantage PXEM can offer over other EM-type algorithms for ML estimation.
Unfortunately, this algorithm can be difficult to use for some Bayesian computations since the maximizer of Q[sub px](Theta, Alpha|Theta[sup (t)], Alpha[sub 0]) may not exist in closed form, even when the corresponding maximum likelihood PXEM algorithm is in closed form and a conjugate prior is used. A simple random-effects model illustrates both this difficulty and the potential computation advantage. Suppose
(2.4) [Multiple line equation(s) cannot be represented in ASCII text]
for i = 1,..., m, where b[sub i] and e[sub i] are independent, {(Y[sub i], Z[sub i]),i = 1,..., m} are the observed data, Alpha is a working parameter, and all quantities are scalars. (It can easily be verified that p(Y[sub obs]|Theta, Alpha), with Theta = (Sigma[sup 2], Tau[sup 2]), does not depend on Alpha.) We set Y[sub aug] = {(Y[sub i], Z[sub i], b[sub i]), i = 1,..., m} and Alpha[sub 0] = 1 to define the data-augmentation scheme via (2.4). We consider the independent priors Sigma[sup 2] is similar to v Sigma[sup 2, sub 0]/Chi[sup 2, sub v] and Tau[sup 2] is similar to Eta Tau[sup 2, sub 0]/Chi[sup 2, sub Eta], which are conjugate for p(Y[sub aug]|Theta, Alpha[sub 0]), in which case
(2.5) [Multiple line equation(s) cannot be represented in ASCII text]
In the absence of prior information, the usual strategy is to reparameterize (e.g., set Xi = Tau[sup 2]/Alpha[sup 2]) in order to simplify the optimization of Q[sub px](Theta, Alpha| Theta', Alpha[sub 0]). Although this works nicely with maximum likelihood it clearly does not work with (2.5). It can be shown that introducing a proper prior for Alpha destroys the computational advantage of parameter expansion, while introducing a dependent prior (e.g., Tau[sup 2]|Alpha is similar to Eta Tau[sup 2, sub 0]Alpha[sup 2]/Chi[sup 2, sub n]) alters the model and Theta[sup *]. Thus, it seems difficult to optimize (2.5) without resorting to iterative numerical methods, the cost of which is likely to outweigh the benefit of PXEM. (This is certainly true with multiple random effects, in which case, the scalar Tau[sup 2] is replaced by a variance-covariance matrix.) Thus, we typically use conditional augmentation for Bayesian mode finding, at least when we use fully proper priors.
In cases where Q[sub px](Theta, Alpha|Theta', Alpha[sub 0]) is easy to maximize, however, the algorithm can be very fast. Suppose for example we use the improper prior p(Theta) approaches (Sigma[sup 2][sup -1] (Tau[sup 2])[sup -1/2] (i.e., Tau[sup 2, sub 0] = 0, Eta = -1). Hobert and Casella (1996) verified that this prior results in a proper posterior as long as m, n >/= 2; see also the Appendix. In this case (2.5) is unbounded for Tau[sup 2] near zero. Thus, the PXEM algorithm converges in one step to the global mode Tau[sup 2] = 0. Figure 1, for example, shows a contour plot of the posterior surface for an artificial dataset of size 100. For this dataset, a standard EM algorithm which fixes Alpha = 1 converges with global rate equal to one (empirical result) and thus takes many iterates to coverage. The second plot in Figure 1 shows a cross section of the posterior surface with Sigma[sup 2] = 1. The conditional posterior is bimodal and a standard EM algorithm which computes (Tau[sup 2)[sup *], given Sigma[sup 2] = 1 with Alpha fixed at one converges to a (local) mode for (Tau[sup 2])(0) >/= .026, the local minimum (again this is an empirical result). The PXEM algorithm, on the other hand, again converges to the global mode in one step for any (Tau[sup 2])[sup (0)].
We emphasize that Figure 1 is included to compare the behavior of PXEM and the standard EM algorithm. It is not necessarily better to converge to the global mode. In particular, the local (conditional) mode contains more of the posterior probability and is driven by the data rather than the prior. Our conclusion is that this improper prior should be used only with great care--the mode is not a sufficient summary of the posterior, at least on the original scale. It is noteworthy that using PXEM brings attention to this difficulty with the posterior.
3.1 EFFICIENT ALGORITHMS WITH PROPER PRIORS
We begin in the fully Bayesian setting because the model notation is the most general and the algorithms that rely on conditional augmentation are simpler and serve as building blocks for the algorithms for computing REML estimates, posterior modes with improper priors (both in Section 3.2), and ML estimates (Section 3.3).
We consider the mixed-effects model of the general form
(3.1) [Multiple line equation(s) cannot be represented in ASCII text]
where Y[sub i]is n[sub i] x 1 for i = 1,..., m, X[sub i] and Z[sub i] are known covariates of dimension n[sub i] x p and n[sub i] x q, respectively, Beta is a p x 1 vector of fixed effects, b[sub i] = (b[sub i1],..., b[sub iq])[sup T] are q x 1 vectors of random effects, R[sub i] are known n[sub i] x n[sub i] positive definite matrixes, and b[sub i] and e[sub i] are independent for each i. We parameterize the variance of the random effects in terms of the residual variance Sigma[sup 2] to facilitate computation but have occasion to use both the parameterizations T = Sigma[sup 2]D and LL[sup T] = D, where L is the Cholesky factor of D (i.e., L is a lower triangular matrix). We introduce the standard priors which are conjugate for the data-augmentation schemes defined below
(3.2) Beta|Sigma[sup 2] is similar to N[sub p](Mu[sub Beta], Sigma[sup 2]Sigma[sub Beta]), Sigma[sup 2] is similar to v Sigma[sup 2, sub 0]/Chi[sup 2, sub v], and T is similar to inverse Wishart(Eta, Tau[sub 0]),
where the inverse Wishart is parameterized so that E(T) = (Eta - q -1)[sup -1] T[sub 0], with Eta the degrees of freedom. On occasion we replace the inverse Wishart prior for T with
(3.3) vec[sub T](L) is similar to N[sub q][sub (q+1)/2](Mu[sub L], Sigma[sup 2]Sigma[sub L]),
where vec[sub T](M) is a vector containing the elements on or below the diagonal of M; the subscript T stands for triangular. Depending on the data-augmentation scheme one or the other of these priors is conjugate. Although the priors do not exactly coincide, either can be used to incorporate similar prior information. With the second prior, this is facilitated by noting that the diagonal element of o-L are prior conditional standard deviations of the components of b[sub i] (i.e., sd(b[sub ij]|b[sub i1],..., b[sub i,j-1]) is the jth diagonal element of Sigma L), while the jkth element of Sigma L (j > k) is the square root of the variance in b[sub ij] not explained by (b[sub i1],..., b[sub i,k-1]) that is explained by b[sub ik].
We use an EM-type algorithm to compute the mode of either p(T, Sigma[sup 2]|Y[sub obs]) or p(L, Sigma[sup 2]|Y[sub obs]), where Y[sub obs] = {(Y[sub i], X[sub i], Z[sub i]), i = 1,..., m}. We are interested in a marginal mode, since working in smaller parameter spaces typically leads to modes with better statistical properties (e.g., Gelman, Carlin, Stem, and Rubin 1995, sec. 9.5). We then use p(b[sub 1],..., b[sub m], Beta]|Zeta, Y[sub obs]) with Zeta = (Sigma[sup 2], D) as derived below to draw inferences regarding the mixed effects.
Using the working parameter, Alpha, we define Y[sub aug] = {(Y[sub i], X[sub i], Z[sub i], L[sup -Alpha]b[sub i]), i = 1,..., m; Beta} and derive two ECME algorithms to compute the marginal mode. Both of the algorithms dichotomize the parameter Zeta = (D, Sigma[sup 2]) into D and Sigma[sup 2], first updating D by maximizing either Q[sub Alpha = 0](T, Sigma[sup 2]| Zeta') or Q[sub Alpha=1] (L, Sigma[sup 2]|Zeta') subject to the constraint that Sigma[sup 2] is fixed and second updating Sigma[sup 2] by maximizing either l(T, Sigma[sup 2]|Y[sub obs]) or l(L, Sigma[sup 2]|Y[sub obs]) subject to the constraint that D is fixed. That is, the algorithms do not use data augmentation when updating Sigma[sup 2] but do require data augmentation to update D. Because of the choice of prior, the first algorithm computes the posterior mode of p(T, Sigma[sup 2]|Y[sub obs]), while the second computes the posterior mode of p(L, Sigma[sup 2]|Y[sub obs]), hence the notation Q[sub Alpha=0](T, Sigma[sup 2]| Zeta') and Q[sub Alpha=1](L, Sigma[sup 2]|Zeta'). Since the algorithms differ in the value of the working parameter Alpha, we call them ECME[sub 0] which corresponds to Alpha = 0, and ECME[sub 1] which corresponds to Alpha = 1. As is illustrated in Section 4, the relative efficiency of ECME[sub 1] and ECME[sub 0] depends roughly on the relative size of (Sigma[sup 2])[sup *] and Sigma Z[sub i, sup T] T[sup *] Z[sub i] (see Meng and van Dyk 1998 for details).
We begin with ECME[sub 0] which updates D by optimizing
(3.4) [Multiple line equation(s) cannot be represented in ASCII text].
To compute the expectations, we use standard calculations to compute the following conditional distributions:
(3.5) [Multiple line equation(s) cannot be represented in ASCII text],
where Beta(D) = (Sigma[sup -1, sub Beta] + Sigma[sup m, sub i=1] X[sup T, sub i] U[sub i](D) X[sub i])[sup -1] (Sigma[sup -1, sub Beta] Mu[sub Beta] + Sigma[sup m, sub i=1] X[sup T, sub i] U[sub i](D)Y[sub i]) and U[sub i](D) = (R[sub i] + Z[sub i] DZ[sub i, sup T])[sup -1];
(3.6) b[sub i]|Zeta, Beta, Y[sub obs] is similar to N(b[sub i](D, Beta), Sigma[sup 2](D - DZ[sub i, sup T] U[sub i](D)Z[sub i] D)),
where b[sub i](D, Beta) = DZ[sub i, sup T] U[sub i](D)(Y[sub i] - X[sub i] Beta; and
(3.7) b[sub i]|Zeta, Y[sub obs] is similar to N (b[sub i] (D, Beta(D)), Sigma[sup 2](D - DZ[sub i, sup T] U[sub i] P[sub i](D)Z[sub i]D)),
where P[sub i](D) = U[sub i](D) - U[sub i](D)X[sub i] (Sigma[sup -1, sub Beta] + Sigma[sup m, sub i=1] X[sub i, sup T] U[sub i] (D)X[sub i])[sup -1] X[sub i, sup T] U[sub i](D). Using (3.7) to compute the second expectation in (3.4), it is easy to verify that Q(T, Sigma[sup 2] + Zeta[sup (t)]) is maximized as a function of D by
(3.8) [Multiple line equation(s) cannot be represented in ASCII text],
where Beta[sub i](Zeta) = E[b[sub i]b[sup T, sub i]|Y[sub obs], Zeta] = b[sub i](D, Beta(D))b[sub i, sup T] (D, Beta(D)) + Sigma[sup 2](D -DZ[sub i, sup T] P[sub i](D)Z[sub i]D). To update Sigmas[sup 2], we write,
[Multiple line equation(s) cannot be represented in ASCII text],
which is maximized as a function of Sigma[sup 2] with D fixed at D[sup (t+1)] by
(3.9) [Multiple line equation(s) cannot be represented in ASCII text].
This completes a single integration of ECME[sub 0]. We note the relationship between this ECME algorithm and the one given by Liu and Rubin (1994) for maximum likelihood estimation. Although they used the same data-augmentation scheme (i.e., Alpha = 0), Liu and Rubin's update for Sigma[sup 2] is not in closed form because they use the parameterization, (Beta, Sigma[sup 2], T) in place of (Beta, Sigma[sup 2], D) in the constraint functions. The required numerical optimization slows down the algorithm substantially (see van Dyk and Meng 1997). The Liu and Rubin algorithm also does not include prior information and does not integrate out the fixed effects. Although their algorithm could be adapted to the Bayesian setting, it is unlikely to be fruitful since the numerical optimization in the update for Sigma[sup 2] is slow. The utility of the parameterization (Beta, Sigma[sup 2], D) was also noted by Schafer (1998) (see also Lindstrom and Bates 1988).
We now turn our attention to Alpha = 1 in the conditional-augmentation scheme and use the alternative prior given in (3.3) in order to maintain a closed-form algorithm. We rescale the random effects by L[sup -1] and consider {c[sub i] equivalent to b[sub i], i = 1, ..., m} to be the missing data. In order to update D, we rewrite (3.1) in terms of (c[sub 1],..., c[sub m]),
(3.10) [Multiple line equation(s) cannot be represented in ASCII text],
where c[sub i] is similar to N(0, Sigma[sup 2] I), e[sub i] is similar to N(0, Sigma[sup 2] R[sub i]), L = (l[sub kj]), and Z[sub i] = (Z[sub i1], ..., Z[sub iq]) with Z[sub ik] an n[sub i] x 1 vector to obtain
[Multiple line equation(s) cannot be represented in ASCII text]
as a function of L. Thus,
(3.11) [Multiple line equation(s) cannot be represented in ASCII text],
where X[sub i] is an n[sub i] (q(q + 1)/2) matrix with columns c[sub ij]Z[sub ik] for j = 1,..., q and k = j,..., q in the ordering that corresponds to vec[sub T](L) and C[sub i](Zeta) = E[X[sub i, sup T] R[sup -1, subi] X[sub i] | Y[sub obs], Zeta], the elements of which are calculated using
(3.12) E[c[sub ij] Z[sup T, sub ik]R[sup -1, sub i]Z[sub ik]'c[sub ij]'|Y[sub obs], Zeta] = [L[sup -1] B[sub i](Zeta)(L[sup -1])[sup T]][sub jj]'Z[sup T, sub ik] R[sup -1, sub i] Z[sub ik]',
where [M][sub jj]' if the (j, j')th element of the matrix M. To compute the expectation in (3.11) we note
(3.13) E[c[sub ij] Z[sup T, sub ik] R[sup -1, sub i] (Y[sub i] - X[sub i]Beta)|Y[sub obs], Zeta] = [L[sup -1] b[sub i](D, Beta(D))][sub j] Z[sup T, sub ik] R[sup -1, sub i] Y[sub i] - Z[sup T, sub ik] R[sup -1, sub i] X[sub i, sup T] [H[sub i](Zeta)][sup T, sub j],
where [v][sub j] is the jth component of the vector v and [H[sub i](Zeta)][sub j] is the jth row of E[c[sub i]Beta[sup T]|Y[sub obs], Zeta],
(3.14) [Multiple line equation(s) cannot be represented in ASCII text].
The iteration is completed by setting D[sup (t+1)] = L[sup (t+1)] (L[sup t+1)])[sup T] and computing
(3.15) [Multiple line equation(s) cannot be represented in ASCII text],
which adjusts (3.9) for the prior. The matrix inversions here and in the various algorithms can be facilitated with the SWEEP operator (Beaton 1964; Little and Rubin 1987; Meng and van Dyk 1998).
We conclude this section with one final data-augmentation scheme. Above we considered the reparameterization LL[sup T] = D, where L is a lower triangular matrix. If instead we allow L to be an arbitrary invertible matrix, we can derive an even more general class of algorithms (see also Foulley and van Dyk in press). Unfortunately, since there is no ready interpretation of the elements of L in this case, establishing a meaningful prior distribution is difficult. Nonetheless, we derive the resulting algorithm, since it is a simple modification of ECME[sub 1] and serves as a useful building block for algorithms with improper priors.
This algorithm, ECME[sub 2], can be expressed easily if in (3.11)-(3.14) we consider L to be a q x q invertible matrix with prior vec(L) is similar to N[sub q, sup 2](Mu[sub L], Sigma[sub L]), where vec(M) is a vector containing all the elements of M. We also substitute X[sub i] with X[sub i], an n[sub i] x q[sup 2] matrix with columns c[sub ij]z[sub ik] for j = 1,..., q and k = 1,..., q in the ordering corresponding to vec(L). With these changes in notation, vec(L[sup (t+1)]) is given in (3.11)-(3.14) and (sigma[sup 1])[sup (t+1)] in (3.15) with the denominator replaced by n + Nu + 2 + q[sup 2] to account for the change in prior.
3.2 REML CALCULATION AND IMPROPER PRIORS
Here we again consider model (3.1) but replace the prior for T with p(T) proportional to 1. Although this corresponds to setting T[sub 0] = 0 and Eta = -(q + 1) in (3.2), and therefore can be fit with ECME[sub 0], in this important special case we can use the data-augmentation scheme used for ECME[sub 2] to derive a parameter-expanded ECME[sub 2] algorithm that is more efficient than ECME[sub 0], ECME[sub 1], or ECME[sub 2]. (An ECME[sub 1] or ECME[sub 2] algorithm can be used if we consider the prior p(L) proportional to 1.) The algorithms derived here assume the improper prior
p(Beta, Sigma[sup 2], T) proportional to (Sigma[sup 2])[sup -(1+(p+v)/2)] exp{-1/2Sigma[sup 2] [v sigma[sup 2, sub 0] + (Beta -Mu[sub Beta])[sup T] Sigma[sup -1, sub Beta] (Beta - Mu[sub Beta])]}.
If in addition, we set Sigma[sup 2, sub 0] = 0, v = - (2 + p) and Sigma[sup -1, sub Beta] = 0, the posterior mode of (Sigma[sup 2], D) corresponds to the REML estimate (Laird and Ware 1982).
To derive the parameter-expanded ECME[sub 2] algorithm, we define Y[sub aug] = {(Y[sub i], X[sub i], Z[sub i], L[sup -1]b[sub i]), i = 1,...,m}; here and in the remainder of the article, L is an arbitrary invertible q x q working parameter. We emphasize that L is not a transformation of T, but a free working parameter. We set L[sub 0] = I at each iteration, thus the E-step is the same for ECME[sub 0] and the parameter expanded ECME[sub 2] algorithm. We update the parameters using the same conditioning scheme as in the fully Bayesian setting. That is, we update (T, L) by maximizing Q[sub px](T, Sigma[sup 2], L|Zeta[sup (t)], L[sub 0]) with Sigma[sup 2] fixed and update Sigma[sup 2] by maximizing l(T, Sigma[sup 2]|Y[sub obs]) with D fixed. In particular, if we replace b[sub i] with Lc[sub i], Q[sub px](T, Sigma[sup 2], L|Zeta[sup (t)], L[sub 0]) is given by (3.4) which we maximize jointly as a function of D and L to update these parameters. This is accomplished via the transformation from (D, L) to (D, L), where D = L[sup -1]D(L[sup -1])[sup T]. In particular, we maximize Q[sub px](T, Sigma[sup 2], L|Zeta[sup (t)], L[sub 0]) by computing L[sup (t+1)] using (3.11) and setting D[sup (t+1)] to the right side of (3.8). In these calculations we set Sigma[sup -1, sub 0] = 0, T[sub 0] = 0, and Eta = -(q + 1) throughout, and set L = L[sub 0] = I and X[sub i] to X[sub i] in (3.11)-(3.14). Finally, we update D with D[sup (t+1)] = L[sup (t+1)]D[sup (t+1)][L[sup (t+1)]][sup T] and complete the iteration by updating Sigma[sup 2] with (3.9).
3.3 MAXIMUM LIKELIHOOD CALCULATIONS
Computing the maximum likelihood estimate is similar to computing the posterior mode with an improper flat prior, (as described in Section 3.2) except we regard Beta as a model parameter and seek Theta[sup *] that maximizes l(Theta|Y[sub obs]), with Theta = (Sigma[sup 2], T, Beta), rather than the mode of the marginal posterior (e.g., l(Sigma[sup 2], T|Y[sub obs])). This simplifies calculations somewhat since Beta is regarded as a constant rather than a random variable in the E-step. In particular, we update L and D to compute D[sup (t+1)] and then compute Beta[sup (t+1)] and (sigma[sup 2])[sup(t+1)] with two separate conditional maximizations. (In all formulas we replace X[sub i] with X[sub i] and set Sigma[sup -1, sub L] = Sigma[sup -1, sub Beta] = T[sub 0] = Sigma[sup 2, sub 0] = 0, Nu = -(p + 2), and Eta = -(q + 1) and in (3.12) we fix L = L[sub 0] = I.) We begin by computing L[sup (t+1)] using (3.11) with two simplifications. First P(D) is replaced with U(D) in the definition of B[sub i](Zeta) used to compute C[sub i](Zeta[sup (t)]) (this change reflects the difference between (3.6) and (3.7)). Second, the expectation is computed conditional on Theta[sup (t)] = (Zeta[sup (t)], Beta[sup (t)]) using
E[c[sub ij]Z[sup T, sub ik] R[sup -1, sub i] (Y[sub i] - X[sub i] Beta)/Y[sub obs], Theta] = [b[sub i](D, Beta)][sub j] Z[sup T, sub ik] R[sup -1, sub i] (Y[sub i] - X[sub i] Beta).
We then compute D[sup (t+1)] using (3.8) again substituting U(D) for P(D) in the definition of Beta[sub i](Zeta) and set D[sup (t+1)] = L[sup (t+1)]D[sup (t+1)](L[sup (t+1)])[sup ]. Next we maximize the observed data likelihood as a function of Beta with D fixed at D[sup (t+1)] and Sigma[sup 2] fixed at (Sigma[sup 2])[sup (t)], with Beta[sup (t+1)] = Beta(D[sup (t+1)]). Likewise, in a final conditional maximization, we update Sigma[sup 2] with
[Multiple line equation(s) cannot be represented in ASCII text].
This completes the parameter-expanded ECME[sub 2] iteration.
In this section we illustrate the relative computational efficiency of ECME[sub 0], ECME[sub 1], ECME[sub 2], and parameter-expanded ECME[sub 1] and ECME[sub 2]. (Parameter expanded ECME[sub 1] is analogous to parameter-expanded ECME[sub 2], but with a lower triangular working parameter.) We fit a REML model to a number of datasets generated from the model
(4.1) Y[sub i] = X Beta + Z[sub i]b[sub i] + e[sub i], for i = 1, ..., 30,
where Y[sub i] is 3 x 1 for each i, X = (1, 1, 1)[sup T] Beta = 1, Z[sub i] is a 3 x 3 matrix with elements generated as independent standard normals, b[sub i] (Multiple lines cannot be converted in ASCII text) N(0, variance = diag(1,4, 9)), and e[sub i] (Multiple lines cannot be converted in ASCII text) N(0, Sigma[sup 2]I), with b[sub i] and e[sub i] independent.
As will be seen in the simulation results (see also Meng and van Dyk 1998 for theoretical arguments), the relative efficiency of ECME[sub 0] and ECME[sub 1] depends on the relative sizes of the fitted values of the variance of Z[sub i]b[sub i] and the residual variance Sigma[sup 2], which we quantify via a measure of the overall coefficient of determination,
Delta[sup *] = Sigma[sup m, sub i=1] tr(Z[sub i] T[sup *] Z[sub i, sup T])/m/ (Sigma[sup 2])[sup *] + Sigma[sup m, sub i=1] tr(Z[sub i] T[sup *] Z[sub i, sup T])/m.
In order to vary Delta[sup *] in the simulations, 50 datasets were generated with each of several values of Sigma[sup 2] (.25, 1, 4, 9, 16, 25, 36, 49, 64, and 81). For each of these 500 datasets, REML estimates were computed using each of five algorithms, ECME[sub i], i = 0, 1,2, and parameter-expanded ECME[sub i], i: 1,2 using flat priors. The starting value (Sigma[sup 2])[sup (0)] was obtained by fitting (4.1) ignoring the random effects, and T[sup (0)] was set to the identity matrix. Each algorithm was run until l((Sigma[sup 2])[sup (t)], T[sup (t)]|Y[sub obs]) - l((Sigma[sup 2]) [sup (t-1)], T[sup (t-1)]|Y[sub obs]) < 10[sup -7].
The results of the simulation appear in Figures 2 and 3. Figure 2 contains five plots which record the time required for each of the five algorithms (Tau[sub i] for ECME[sub i], i = 0, 1, 2, and Tau[sub px-i] for parameter-expanded ECME[sub i], i = 1, 2, in seconds on the log[sub 10] scale). All computations were run on a Sun UltraSparc computer. Judging from the first plot in Figure 2, ECME[sub 0] performs well when Delta[sup *] is large, but often required more than 20 seconds to converge when Delta[sup *] was relatively small. Conversely, ECME[sub 1] and ECME[sub 2] perform well when Delta[sup *] are small, but occasionally are very slow when Delta[sup *] is large. Finally, the parameter-expanded ECME algorithms converge quickly for all values of Delta[sup *] in the simulation. Nonetheless, increasing the dimension of the working parameter results in further advantage; compare parameter-expanded ECME[sub 1] with parameter-expanded ECME[sub 2]. Parameter-expanded ECME[sub 2] performs very well overall, requiring more than four seconds only twice and less than one second in 84% of the replications.
Figure 3 compares ECME[sub 0], ECME[sub 2] and parameter-expanded ECME[sub 2] by recording the log[sub 10] relative time required by each pair of two algorithms (ECME[sub 2] versus ECME[sub 0], parameter-expanded ECME[sub 2] versus ECME[sub 0], and parameter-expanded ECME[sub 2] versus ECME[sub 2]). Theory in Meng and van Dyk (1998) suggested ECME[sub 0] is faster than ECME[sub 1] only when Delta[sup *] > 2/3 or logit[sub 10](Delta[sup *]) > .30 (approximately); the simulation verifies the same relationship between ECME[sub 0] and ECME[sub 2]. It is also clear that ECME[sub 2] can be much faster than ECME[sub 0], while ECME[sub 0] shows only small relative gains over ECME[sub 2]. Thus, we recommend using ECME[sub 2] over ECME[sub 0] (e.g., for fully Bayesian analysis) unless it is known a priori that Delta[sup *] is large. [Meng and van Dyk (1998) discussed an adaptive strategy which approximates Delta[sup *] after several initial iterations to choose between Alpha = 0 and Alpha = 1; see also Section 5.] The final two plots show that parameter-expanded ECME[sub 2] tends to be about as fast as the faster of ECME[sub 0] and ECME[sub 2]. Thus, we recommend using parameter-expanded ECME[sub 2] whenever a flat prior is used for T (e.g., REML and ML calculations).
The horizontal line in each plot of Figure 2 corresponds to four seconds, the approximate time required to fit this model using the lme routine in S-Plus on the same computer. The computation time required by parameter-expanded ECME[sub 2] is generally less than four seconds, but with the additional important advantage of computational stability (e.g., monotone convergence in log-likelihood; see Section 5). Meng and van Dyk (1998) compared EM-type algorithms similar to ECME[sub 0] and ECME[sub 1] with the lme routine in S-Plus and the xtreg routine in STATA. Both comparisons were favorable for the EMtype algorithms in terms of both computational time and stability. In their investigations lme did not always converge to a mode of the log-likelihood. As a further comparison, we randomly selected one dataset generated with each of the ten values of Sigma[sup 2] and fit the mixed-effects model with lme, parameter-expanded ECME[sub 2], and the proc mixed routine in SAS. Table 1 displays the value of the restricted log-likelihood at the point of convergence for each algorithm applied to each of the ten datasets. Unfortunately, in six of the ten replications, proc mixed did not converge at all or converged to a value of T[sup *] that was not positive semidefinite. The lme routine again did not always converge to a mode of the log-likelihood. (There is now a new version of the line routine (lme 3.0) that may perform better. S-Plus 5.0 for Windows includes lme 3.0--however, it is not expected to be included in UNIX/Linux S-Plus until Release 6.0.) Parameter-expanded ECME[sub 2] (and the other EM-type algorithms) exhibited no convergence difficulties.
The simulation results in Section 4 agree with theoretical arguments given elsewhere (e.g., Meng and van Dyk 1998). It is clear parameter-expanded ECME[sub 2] is a general purpose, reliable, and efficient algorithm for REML and ML calculations with mixed-effects models. For Bayesian calculation with a proper prior on the random-effects variance, there is unfortunately no known efficient parameter-expanded algorithm. Thus, we recommend using either ECME[sub 0] or ECME[sub 1]. The choice between these algorithms is based on the size of Delta[sup *], which can be approximated by replacing (Sigma[sup 2])[sup *] and T[sup *] with a priori values, values computed with initial iterations from one of the ECME algorithms, or perhaps REML estimates computed using parameter-expanded ECME[sub 2]. This final strategy is especially attractive when the prior distributions are very diffuse since REML estimates should also serve as very good starting values for ECME[sub 0] or ECME[sub 1].
Based on the simulations, these algorithms are comparable to commercially available software in terms of the computation time required for convergence (see also Meng and van Dyk 1998) but with important advantages. The EM-type algorithms are guaranteed to increase l(Theta|Y[sub obs]) at each iteration and to converge to estimates within the parameter space. Other methods (e.g., lme and proc mixed) require special monitoring and still may exhibit poor behavior. The line routine, for example, was in general release for years before it was discovered that without special user intervention lme can converge to a point that is far from a mode. This was discovered by comparing lme with EM-type algorithms which converged correctly (Meng and van Dyk 1998). (This lme routine remains in the current version of S-Plus for UNIX/Linux.)
A third advantage of the EM-type algorithms is that they can be modified to handle more sophisticated models that cannot be fit using standard software. For example, missing values among {(Y[sub i], X[sub i], Z[sub i]), i = 1,..., m} can be accommodated using a model for the missing data and a revised (perhaps Monte Carlo) E-step. Certain generalized linear mixed models can also be fit using efficient new data-augmentation methods (e.g., parameter expansion and nesting, see van Dyk 2000). For example, probit hierarchical regression models can be viewed as an extension of (3.1) in which we observe only the sign of each component of Y[sub i], i = 1,..., m. The Y[sub i] themselves are considered to be missing data. We can also extend the model to a hierarchical t model by replacing the normal distributions for either or both of e[sub i] and b[sub i] with t distributions. This is accomplished by writing the t variable as the scaled ratio of a normal variable and the square root of an independent chi-square variable and treating the chi-square variable as missing data (e.g, Dempster, Laird, and Rubin 1977). Thus, by finding very efficient EM-type algorithms for the mixed-effects model, we add an important building block for a variety of important models.
The author's research was supported in part by NSF grant DMS 97-05157 and in part by the U.S. Census Bureau. The author thanks the referees and editors of the Journal of Computational and Graphical Statistics, for their helpful comments and J. L. Foulley for several suggestions that have greatly improved the article, in particular the use of an arbitrary invertible working parameter matrix.
[Received August 1998. Revised June 1999.]
Legend for Chart: A - Sigma[sup 2] B - Algorithm lme (S-Plus) C - Algorithm proc mixed (SAS) D - Algorithm PX-ECME[sub 2] A B C D .25 -184.11 -184.06 -184.06 1 -218.38 -218.39 -218.39 4 -227.98 [B] -227.98 9 -252.93 -252.93 -252.93 16 -269.75 [B] -269.75 25 -289.70[A] -280.62 -280.62 36 -301.14 -301.03[*] -301.14 49 -304.69 [B] -304.69 64 -321.41[A] -319.85[*] -320.64 81 -327.40[A] [B] -325.44
GRAPHS: Figure 1. The posterior surface of the parameters in model (2.4) using an artificial dataset. The first plot is a contour plot of the joint posterior of (Sigma[sup 2], Tau[sup 2]); the second plot shows the conditional posterior of ,.2 given 0,2 = 1. In both cases PXEM converges to the global model in one step while the standard EM algorithm can be very slow to converge to a (local) mode.
GRAPHS: Figure 2. The time required by ECME[sub i] (Tau[sub i]), i = 1, 2, 3 and parameter-expanded ECME[sub i] (Tau[sub px-i]), i = 1, 2, to fit (4.1) to each of the 500 simulated datasets. While ECME[sub 0], ECME[sub 1] and ECME[sub 2] can be slow to converge for extreme values of Delta[sup *], the parameter-expanded ECME algorithms perform well for all values of Delta[sup *] in the simulation. Increasing the dimension of the working parameter further improves computational performance (e.g., parameter-expanded ECME[sub 2]). The horizontal line in each plot represents the approximate time required by the lme routine in S-Plus. It is clear that parameter-expanded ECME[sub 2] performs well relative to lme, with superior convergence properties (e.g., monotone convergence in log posterior).
GRAPHS: Figure 3. The relative time required by pairs of the three algorithms (ECME[sub 0], ECME[sub 2], and parameter-expanded ECME[sub 2]). The choice between ECME[sub 0] and ECME[sub 2] depends on Delta[sup *] while parameter-expanded ECME[sub 2] typically outperforms either algorithm.
Beaton, A. E. (1964), "The Use of Special Matrix Operations in Statistical Calculus," Education Testing Service Research Bulletin, RB-64-51.
Callanan, T. P., and Harville, D. A. (1991), "Some New Algorithms for Computing Restricted Maximum Likelihood Estimates of Variance Components," Journal of Statistical Computation and Simulation, 38, 239-259.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Likelihood Estimation from Incomplete-Data via the EM Algorithm" (with discussion), Journal of