\documentclass[12pt]{article} \usepackage{Sweave} %\SweaveOpts{eps=TRUE} %\usepackage[footnotesize]{caption} \usepackage{caption} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{amscd} \usepackage{epsfig} \usepackage{fullpage} %\renewcommand{\baselinestretch}{1.5} \newcommand{\bm}[1]{\mbox{\boldmath $#1$}} \newcommand{\mb}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\mr}[1]{\mathrm{#1}} \newcommand{\mbb}[1]{\mathbb{#1}} %\VignetteIndexEntry{new features in tgp version 2.x} %\VignetteKeywords{tgp2} %\VignetteDepends{tgp,maptree,MASS} %\VignettePackage{tgp} \begin{document} \setkeys{Gin}{width=0.85\textwidth} <>= library(tgp) options(width=65) @ \title{Categorical inputs, sensitivity analysis,\\ optimization and importance tempering\\ with {\tt tgp} version 2, an {\sf R} package for\\ treed Gaussian process models} \author{ Robert B. Gramacy\\ Department of Statistics\\ Virginia Tech\\ rbg@vt.edu \and Matt Taddy\\ Amazon\\ mataddy@amazon.com } \maketitle \begin{abstract} This document describes the new features in version 2.x of the {\tt tgp} package for {\sf R}, implementing treed Gaussian process (GP) models. The topics covered include methods for dealing with categorical inputs and excluding inputs from the tree or GP part of the model; fully Bayesian sensitivity analysis for inputs/covariates; %multiresolution (treed) Gaussian process modeling; sequential optimization of black-box functions; and a new Monte Carlo method for inference in multi-modal posterior distributions that combines simulated tempering and importance sampling. These additions extend the functionality of {\tt tgp} across all models in the hierarchy: from Bayesian linear models, to CART, to treed Gaussian processes with jumps to the limiting linear model. %, except in the case of multiresolution models which apply only %to the (treed) GP. It is assumed that the reader is familiar with the baseline functionality of the package, outlined in the first vignette \cite{gramacy:2007}. \end{abstract} \subsection*{Intended audience} \label{sec:discaimer} The {\tt tgp} package contains implementations of seven related Bayesian regression frameworks which combine treed partition models, linear models (LM), and stationary Gaussian process (GP) models. GPs are flexible (phenomenological) priors over functions which, when used for regression, are usually relegated to smaller applications for reasons of computational expense. Trees, by contrast, are a crude but efficient divide-and-conquer approach to non-stationary regression. When combined they are quite powerful, and provide a highly flexible nonparametric and non-stationary family of regression tools. These treed GP models have been successfully used in a variety of contexts, in particular in the sequential design and analysis of computer experiments. The models, and the (base) features of the package, are described the vignette for version 1.x of the package \cite{gramacy:2007}. This document is intended as a follow-on, describing four new features that have been added to the package in version 2.x. As such, it is divided into four essentially disjoint sections: on categorical inputs (Section \ref{sec:cat}), sensitivity analysis (Section \ref{sec:sens}), statistical optimization (Section \ref{sec:optim}), and importance tempering (Section \ref{sec:it}). The ability to deal with categorical inputs greatly expands the sorts of regression problems which {\tt tgp} can handle. It also enables the partition component of the model to more parsimoniously describe relationships that were previously left to the GP part of the model, at a great computational expense and interpretational disadvantage. The analysis of sensitivity to inputs via the predictive variance enables the user to inspect, and understand, the first-order and total effects of each of the inputs on the response. The section on statistical optimization expands the sequential design feature set described in the first vignette. We now provide a skeleton which automates the optimization of black-box functions by expected improvement, along with tools and suggestions for assessing convergence. Finally, the addition of tempering-based MCMC methods leads to more reliable inference via a more thorough exploration of the highly multi-modal posterior distributions that typically result from tree based models, which previously could only be addressed by random restarts. Taken all together, these four features have greatly expanded the capabilities of the package, and thus the variety of statistical problems which can be addressed with the {\tt tgp} family of methods. Each of the four sections to follow will begin with a short mathematical introduction to the new feature or methodology and commence with extensive examples in {\sf R} on synthetic and real data. This document has been authored in {\tt Sweave} (try {\tt help(Sweave)}). This means that the code quoted throughout is certified by {\sf R}, and the {\tt Stangle} command can be used to extract it. As with the first vignette, the {\sf R} code in each of the sections to follow is also available as a demo in the package. Note that this tutorial was not meant to serve as an instruction manual. For more detailed documentation of the functions contained in the package, see the package help--manuals. At an {\sf R} prompt, type {\tt help(package=tgp)}. PDF documentation is also available on the world-wide-web. \begin{center} \tt http://www.cran.r-project.org/doc/packages/tgp.pdf \end{center} Each section starts by seeding the random number generator with \verb!set.seed(0)!. This is done to make the results and analyses reproducible within this document (assuming identical architecture [64-bit Linux] and version of {\sf R} [2.10.1]), and in demo form. We recommend you try these examples with different seeds and see what happens. Usually the results will be similar, but sometimes (especially when the data ({\tt X},{\tt Z}) is generated randomly) they may be quite different. \section{Non--real--valued, categorical and other inputs} \label{sec:cat} <>= seed <- 1; set.seed(seed) ## seed zero problematic with null btlm map tree below @ Early versions of {\tt tgp} worked best with real--valued inputs $\mb{X}$. While it was possible to specify ordinal, integer--valued, or even binary inputs, {\tt tgp} would treat them the same as any other real--valued input. Two new arguments to {\tt tgp.default.params}, and thus the ellipses ({\tt ...}) argument to the {\tt b*} functions, provide a more natural way to model with non--real valued inputs. In this section we shall introduce these extensions, and thereby illustrate how the current version of the package can more gracefully handle categorical inputs. We argue that the careful application of this new feature can lead to reductions in computational demands, improved exploration of the posterior, increased predictive accuracy, and more transparent interpretation of the effects of categorical inputs. Classical treed methods, such as CART \cite{brei:1984}, can cope quite naturally with categorical, binary, and ordinal, inputs. Categorical inputs can be encoded in binary, and splits can be proposed with rules such as $x_i < 1$. Once a split is made on a binary input, no further process is needed, marginally, in that dimension. Ordinal inputs can also be coded in binary, and thus treated as categorical, or treated as real--valued and handled in a default way. GP regression, however, handles such non--real--valued inputs less naturally, unless (perhaps) a custom and non--standard form of the covariance function is used \cite{qian:wu:wu:2009}. When inputs are scaled to lie in $[0,1]$, binary--valued inputs $x_i$ are always a constant distance apart---at the largest possible distance in the range. A separable correlation function width parameter $d_i$ will tend to infinity (in the posterior) if the output does not vary with $x_i$, and will tend to zero if it does. Clearly, this functionality is more parsimoniously achieved by partitioning, e.g., using a tree. However, trees with fancy regression models at the leaves pose other problems, as discussed below. Consider as motivation, the following modification of the Friedman data \cite{freid:1991} (see also Section 3.5 of \cite{gramacy:2007}). Augment 10 real--valued covariates in the data ($\mb{x} = \{x_1,x_2,\dots,x_{10}\}$) with one categorical indicator $I\in\{1,2,3,4\}$ that can be encoded in binary as \begin{align*} 1& \equiv (0,0,0) & 2 &\equiv (0,0,1) & 3 &\equiv (0,1,0) & 4 &\equiv (1,0,0). \end{align*} Now let the function that describes the responses ($Z$), observed with standard Normal noise, have a mean \begin{equation} E(Z|\mb{x}, I) = \left\{ \begin{array}{cl} 10 \sin(\pi x_1 x_2) & \mbox{if } I = 1 \\ 20(x_3 - 0.5)^2 &\mbox{if } I = 2 \\ 10x_4 + 5 x_5 &\mbox{if } I = 3 \\ 5 x_1 + 10 x_2 + 20(x_3 - 0.5)^2 + 10 \sin(\pi x_4 x_5) &\mbox{if } I = 4 \label{eq:f1b} \end{array} \right. \end{equation} that depends on the indicator $I$. Notice that when $I=4$ the original Friedman data is recovered, but with the first five inputs in reverse order. Irrespective of $I$, the response depends only on $\{x_1,\dots,x_5\}$, thus combining nonlinear, linear, and irrelevant effects. When $I=3$ the response is linear $\mb{x}$. A new function has been included in the {\tt tgp} package which facilitates generating random realizations from (\ref{eq:f1b}). Below we obtain 500 such random realizations for training purposes, and a further 1000 for testing. <<>>= fb.train <- fried.bool(500) X <- fb.train[,1:13]; Z <- fb.train$Y fb.test <- fried.bool(1000) XX <- fb.test[,1:13]; ZZ <- fb.test$Ytrue @ A separation into training and testing sets will be useful for later comparisons by RMSE. The names of the data frame show that the first ten columns encode $\mb{x}$ and columns 11--13 encode the boolean representation of $I$. <<>>= names(X) @ One, na\"ive approach to fitting this data would be to fit a treed GP LLM model ignoring the categorical inputs. But this model can only account for the noise, giving high RMSE, and so is not illustrated here. Clearly, the indicators must be included. One simple way to do so would be to posit a Bayesian CART model. <<>>= fit1 <- bcart(X=X, Z=Z, XX=XX, verb=0) rmse1 <- sqrt(mean((fit1$ZZ.mean - ZZ)^2)) rmse1 @ In this case the indicators are treated appropriately (as indicators), but in some sense so are the real--valued inputs as only constant models are fit at the leaves of the tree. \begin{figure}[ht!] <>= tgp.trees(fit1, "map") @ <>= graphics.off() @ \centering \includegraphics[trim=0 100 0 25]{tgp2-cat-fbcart-mapt} \caption{Diagrammatic depiction of the maximum {\em a' posteriori} (MAP) tree for the boolean indicator version of the Friedman data in Eq.~(\ref{eq:f1b}) using Bayesian CART.} \label{f:fb:cart} \end{figure} Figure \ref{f:fb:cart} shows that the tree does indeed partition on the indicators, and the other inputs, as expected. One might expect a much better fit from a treed linear model to this data, since the response is linear in some of its inputs. <<>>= fit2 <- btlm(X=X, Z=Z, XX=XX, verb=0) rmse2 <- sqrt(mean((fit2$ZZ.mean - ZZ)^2)) rmse2 @ Unfortunately, this is not the case---the RMSE obtained is similar to the one for the CART model. \begin{figure}[ht!] <>= tgp.trees(fit2, "map") @ <>= graphics.off() @ \centering \includegraphics[trim=0 100 0 25]{tgp2-cat-fbtlm-mapt} \caption{Diagrammatic depiction of the maximum {\em a' posteriori} (MAP) tree for the boolean indicator version of the Friedman data in Eq.~(\ref{eq:f1b}) using a Bayesian treed linear model.} \label{f:fb:btlm:trees} \end{figure} Figure \ref{f:fb:btlm:trees} shows that the tree does indeed partition, but not on the indicator variables. When a linear model is used at the leaves of the tree the boolean indicators cannot be partitioned upon because doing so would cause the design matrix to become rank--deficient at the leaves of the tree (there would be a column of all zeros or all ones). A treed GP would have the same problem. A new feature in {\tt tgp} makes dealing with indicators such as these more natural, by including them as candidates for treed partitioning, but ignoring them when it comes to fitting the models at the leaves of the tree. The argument {\tt basemax} to {\tt tgp.default.params}, and thus the ellipses ({\tt ...}) argument to the {\tt b*} functions, allows for the specification of the last columns of {\tt X} to be considered under the base (LM or GP) model. In the context of our example, specifying {\tt basemax = 10} ensures that only the first 10 inputs, i.e., $\mb{X}$ only (excluding $I$), are used to predict the response under the GPs at the leaves. Both the columns of $\mb{X}$ and the columns of the boolean representation of the (categorical) indicators $I$ are (still) candidates for partitioning. This way, whenever the boolean indicators are partitioned upon, the design matrix (for the GP or LM) will not contain the corresponding column of zeros or ones, and therefore will be of full rank. Let us revisit the treed LM model with {\tt basemax = 10}. <<>>= fit3 <- btlm(X=X, Z=Z, XX=XX, basemax=10, verb=0) rmse3 <- sqrt(mean((fit3$ZZ.mean - ZZ)^2)) rmse3 @ \begin{figure}[ht!] <>= tgp.trees(fit3, "map") @ <>= graphics.off() @ \centering \includegraphics[trim=0 90 0 25,scale=0.75]{tgp2-cat-fbtlm-mapt} \caption{Diagrammatic depiction of the maximum {\em a' posteriori} (MAP) tree for the boolean indicator version of the Friedman data in Eq.~(\ref{eq:f1b}) using a Bayesian treed linear model with the setting {\tt basemax = 10}.} \label{f:fb:btlm:mapt} \end{figure} Figure \ref{f:fb:btlm:mapt} shows that the MAP tree does indeed partition on the indicators in an appropriate way---as well as on some other real--valued inputs---and the result is the lower RMSE we would expect. A more high--powered approach would clearly be to treat all inputs as real--valued by fitting a GP at the leaves of the tree. Binary partitions are allowed on all inputs, $\mb{X}$ and $I$, but treating the boolean indicators as real--valued in the GP is clearly inappropriate since it is known that the process does not vary smoothly over the $0$ and $1$ settings of the three boolean indicators representing the categorical input $I$. <<>>= fit4 <- btgpllm(X=X, Z=Z, XX=XX, verb=0) rmse4 <- sqrt(mean((fit4$ZZ.mean - ZZ)^2)) rmse4 @ Since the design matrices would become rank--deficient if the boolean indicators are partitioned upon, there was no partitioning in this example. <<>>= fit4$gpcs @ Since there are large covariance matrices to invert, the MCMC inference is {\em very} slow. Still, the resulting fit (obtained with much patience) is better that the Bayesian CART and treed LM (with {\tt basemax = 10}) ones, as indicated by the RMSE. We would expect to get the best of both worlds if the setting {\tt basemax = 10} were used when fitting the treed GP model, thus allowing partitioning on the indicators by guarding against rank deficient design matrices. <<>>= fit5 <- btgpllm(X=X, Z=Z, XX=XX, basemax=10, verb=0) rmse5 <- sqrt(mean((fit5$ZZ.mean - ZZ)^2)) rmse5 @ And indeed this is the case. The benefits go beyond producing full rank design matrices at the leaves of the tree. Loosely speaking, removing the boolean indicators from the GP part of the treed GP gives a more parsimonious model, without sacrificing any flexibility. The tree is able to capture all of the dependence in the response as a function of the indicator input, and the GP is the appropriate non--linear model for accounting for the remaining relationship between the real--valued inputs and outputs. \begin{figure}[ht!] <>= h <- fit1$post$height[which.max(fit1$posts$lpost)] tgp.trees(fit5, "map") @ <>= graphics.off() @ \centering \includegraphics[trim=0 100 0 25]{tgp2-cat-fb-mapt} \caption{Diagrammatic depiction of the maximum {\em a' posteriori} (MAP) tree for the boolean indicator version of the Friedman data in Eq.~(\ref{eq:f1b}) using {\tt basemax=10}.} \label{f:fb:mapt} \end{figure} We can look at the maximum {\em a' posteriori} (MAP) tree, to see that only (and all of) the indicators were partitioned upon in Figure \ref{f:fb:mapt}. Further advantages to this approach include speed (a partitioned model gives smaller covariance matrices to invert) and improved mixing in the Markov chain when a separable covariance function is used. Note that using a non--separable covariance function in the presence of indicators would result in a poor fit. Good range ($d$) settings for the indicators would not necessarily coincide with good range settings for the real--valued inputs. A complimentary setting, {\tt splitmin}, allows the user to specify the first column of the inputs {\tt X} upon which treed partitioning is allowed. From Section 3.5 of the first {\tt tgp} vignette \cite{gramacy:2007}, it was concluded that the original formulation of Friedman data was stationary, and thus treed partitioning is not required to obtain a good fit. The same would be true of the response in (\ref{eq:f1b}) after conditioning on the indicators. Therefore, the most parsimonious model would use {\tt splitmin = 11}, in addition to {\tt basemax = 10}, so that only $\mb{X}$ are under the GP, and only $I$ under the tree. Fewer viable candidate inputs for treed partitioning should yield improved mixing in the Markov chain, and thus lower RMSE. <<>>= fit6 <- btgpllm(X=X, Z=Z, XX=XX, basemax=10, splitmin=11, verb=0) rmse6 <- sqrt(mean((fit6$ZZ.mean - ZZ)^2)) rmse6 @ Needless to say, it is important that the input {\tt X} have columns which are ordered appropriately before the {\tt basemax} and {\tt splitmin} arguments can be properly applied. Future versions of {\tt tgp} will have a formula--based interface to handle categorical ({\tt factors}) and other inputs more like other {\sf R} regression routines, e.g., {\tt lm} and {\tt glm}. The tree and binary encodings represent a particularly thrifty way to handle categorical inputs in a GP regression framework, however it is by no means the only or best approach to doing so. A disadvantage to the binary coding is that it causes the introduction of several new variables for each categorical input. Although they only enter the tree part of the model, and not the GP (where the introduction of many new variables could cause serious problems), this may still be prohibitive if the number of categories is large. Another approach that may be worth considering in this case involves designing a GP correlation function which can explicitly handle a mixture of qualitative (categorical) and quantitative (real-valued) factors \cite{qian:wu:wu:2009}. An advantage of our treed approach is that it is straightforward to inspect the effect of the categorical inputs by, e.g., counting the number of trees (in the posterior) which contain a particular binary encoding. It is also easy to see how the categorical inputs interact with the real-valued ones by inspecting the (posterior) parameterizations of the correlation parameters in each partition on a binary encoding. Both of these are naturally facilitated by gathering traces ({\tt trace = TRUE}), as described in the 1.x vignette \cite{gramacy:2007}. In Section \ref{sec:sens} we discuss a third way of determining the sensitivity of the response to categorical and other inputs. \section{Analysis of sensitivity to inputs} \label{sec:sens} <>= seed <- 0; set.seed(seed) @ Methods for understanding how inputs, or explanatory variables, contribute to the outputs, or response, of simple statistical models are by now classic in the literature and frequently used in practical application. For example, in linear regression one can perform $F$--tests to ascertain the relevance of a predictor, or inspect the leverage of a particular input setting, or use Cooks' distance, to name a few. Unfortunately, such convenient statistics/methods are not available for more complicated models, such as those in the {\tt tgp} family of nonparametric models. A more advanced tool is needed. Sensitivity Analysis (SA) is a resolving of the sources of output variability by apportioning elements of this variation to different sets of input variables. It is applicable in wide generality. The edited volume by Saltelli et al.~\cite{SaltChanScot2000} provides an overview of the field. Valuable recent work on smoothing methods is found in \cite{StorHelt2008,VeigWahlGamb2009}, and Storlie, et al.~\cite{StorSwilHeltSall2009}, provide a nice overview of nonparametric regression methods for inference about sensitivity. The analysis of response variability is useful in a variety of different settings. For example, when there is a large number of input variables over which an objective function is to be optimized, typically only a small subset will be influential within the confines of their uncertainty distribution. SA can be used to reduce the input space of such optimizations \cite{TaddLeeGrayGrif2009}. Other authors have used SA to assess the risk associated with dynamic factors affecting the storage of nuclear waste \cite{HommSalt1996}, and to investigate the uncertainty characteristics of a remote sensing model for the reflection of light by surface vegetation \cite{MorrKottTaddFurfGana2008}. The {\tt sens} function adds to {\tt tgp} a suite of tools for global sensitivity analysis, and enables ``out-of-the-box'' estimation of valuable sensitivity indices for any regression relationship that may be modeled by a member of the {\tt tgp} family. The type of sensitivity analysis provided by {\tt tgp} falls within the paradigm of global sensitivity analysis, wherein the variability of the response is investigated with respect to a probability distribution over the entire input space. The recent book by Saltelli et al. \cite{SaltEtAl2008} serves as a primer on this field. Global SA is inherently a problem of statistical inference, as evidenced by the interpolation and estimation required in a study of the full range of inputs. This is in contrast with the analytical nature of local SA, which involves derivative--based investigation of the stability of the response over a small region of inputs. We will ignore local SA for the remainder of this document. The sensitivity of a response $z$ to a changing input $\mb{x}$ is always considered in relation to a specified {\it uncertainty distribution}, defined by the density $u(\mb{x})$, and the appropriate marginal densities $u_i(x_i)$. What is represented by the uncertainty distribution changes depending upon the context. The canonical setup has that $z$ is the response from a complicated physics or engineering simulation model, with tuning parameters $\mb{x}$, that is used to predict physical phenomena. In this situation, $u(\mb{x})$ represents the experimentalist's uncertainty about real--world values of $\mb{x}$. In optimization problems, the uncertainty distribution can be used to express prior information from experimentalists or modelers on where to look for solutions. Finally, in the case of observational systems (such as air-quality or smog levels), $u(\mb{x})$ may be an estimate of the density governing the natural occurrence of the $\mb{x}$ factors (e.g., air-pressure, temperature, wind, and cloud cover). In this setup, SA attempts to resolve the natural variability of $z$. The most common notion of sensitivity is tied to the relationship between conditional and marginal variance for $z$. Specifically, variance--based methods decompose the variance of the objective function, with respect to the uncertainty distribution on the inputs, into variances of conditional expectations. These are a natural measure of the output association with specific sets of variables and provide a basis upon which the importance of individual inputs may be judged. The other common component of global SA is an accounting of the main effects for each input variable, $\mathbb{E}_{u_j}[z|x_j]$, which can be obtained as a by-product of the variance analysis. Our variance--based approach to SA is a version of the method of Sobol', wherein a deterministic objective function is decomposed into summands of functions on lower dimensional subsets of the input space. Consider the function decomposition $ f(x_1, \ldots ,x_d) = f_0 + \sum_{j=1}^df_j(x_j) +\sum_{1 \leq i < j \leq d} f_{ij}(x_j,x_i) + \ldots + f_{1,\ldots,d}(x_1, \ldots ,x_d). $ When the response $f$ is modeled as a stochastic process $z$ conditional on inputs $\mb{x}$, we can develop a similar decomposition into the response distributions which arise when $z$ has been marginalized over one subset of covariates and the complement of this subset is allowed to vary according to a marginalized uncertainty distribution. In particular, we can obtain the marginal conditional expectation $\mbb{E}[z|\mb{x}_J=\{x_j:j\in J\}]$ $=$ $\int_{\mathbb{R}^{d-d_J}} \mbb{E}[z|\mb{x}]u(\mb{x}) d\mb{x}_{-J}$, where $J=\{j_1, \ldots, j_{d_J}\}$ indicates a subset of input variables, $\mb{x}_{-j} =\{x_j:j\notin J\}$, and the marginal uncertainty density is given by $u_J(\mb{x}_J) = \int_{\mathbb{R}^{d-d_J}} u(\mb{x}) d\{x_i:i \notin J \}$. SA concerns the variability of $\mbb{E}[z|\mb{x}_J]$ with respect to changes in $\mb{x}_J$ according to $u_J(\mb{x}_J)$ and, if $u$ is such that the inputs are uncorrelated, the variance decomposition is available as \begin{equation} \label{eqn:var_decomp} \mr{var}(\mbb{E}[z|\mb{x}]) = \sum_{j=1}^dV_j + \sum_{1 \leq i < j \leq d} V_{ij} + \ldots + V_{1,\ldots,d}, \end{equation} where $V_j = \mr{var}(\mbb{E}[z|x_j])$, $V_{ij}=\mr{var}(\mbb{E}[z|x_i, x_j]) - V_i - V_j$, and so on. Clearly, when the inputs are correlated this identity no longer holds (although a ``less-than-or-equal-to'' inequality is always true). But it is useful to retain an intuitive interpretation of the $V_J$'s as a portion of the overall marginal variance. Our global SA will focus on the related sensitivity indices $S_J = V_J/\mr{var}(z)$ which, as can be seen in the above equation, will sum to one over all possible $J$ and are bounded to $[0,1]$. These $S_J$'s provide a natural measure of the {\it importance} of a set $J$ of inputs and serve as the basis for an elegant analysis of sensitivity. The {\tt sens} function allows for easy calculation of two very important sensitivity indices associated with each input: the 1$^{\rm st}$ order for the $j$th input variable, \begin{equation} S_j = \frac{\mr{var}\left(\mbb{E}\left[z|x_j\right]\right)}{\mr{var}(z)}, \label{eq:S} \end{equation} and the total sensitivity for input $j$, \begin{equation} T_j = \label{eq:T} \frac{\mbb{E}\left[\mr{var}\left(z|\mb{x}_{-j}\right)\right]}{\mr{var}(z)}. \end{equation} The 1$^{\rm st}$ order indices measure the portion of variability that is due to variation in the main effects for each input variable, while the total effect indices measure the portion of variability that is due to total variation in each input. From the identity $\mbb{E}\left[\mr{var}\left(z|\mb{x}_{-j}\right)\right] = \mr{var}(z) - \mr{var}\left(\mbb{E}\left[z|\mb{x}_{-j}\right]\right)$, it can be seen that $T_j$ measures the {\it residual} variability remaining after variability in all other inputs has been apportioned and that, for a deterministic response and uncorrelated input variables, $T_j = \sum_{J:j \in J} S_J$. This implies that the difference between $T_j$ and $S_j$ provides a measure of the variability in $z$ due to interaction between input $j$ and the other input variables. A large difference may lead the investigator to consider other sensitivity indices to determine where this interaction is most influential, and this is often a key aspect of the dimension--reduction that SA provides for optimization problems. \subsection{Monte Carlo integration for sensitivity indices} Due to the many integrals involved, estimation of the sensitivity indices is not straightforward. The influential paper by Oakley \& O'Hagan \cite{OaklOhag2004} describes an empirical Bayes estimation procedure for the sensitivity indices, however some variability in the indices is lost due to plug-in estimation of GP model parameters and, more worryingly, the variance ratios are only possible in the form of a ratio of expected values. Marrel, et al.~\cite{MarrIoosLaurRous2009}, provide a more complete analysis of the GP approach to this problem, but their methods remain restricted to estimation of the first order Sobol indices. Likelihood based approaches have also been proposed \cite{WelcBuckSackWynnMitcMorr1992,MorrKottTaddFurfGana2008}. The technique implemented in {\tt tgp} is, in contrast, fully Bayesian and provides a complete accounting of the uncertainty involved. Briefly, at each iteration of an MCMC chain sampling from the treed GP posterior, output is predicted over a large (carefully chosen) set of input locations. Conditional on this predicted output, the sensitivity indices can be calculated via Monte Carlo integration. By conditioning on the predicted response (and working as though it were the observed response), we obtain a posterior sample of the indices, incorporating variability from both the integral estimation and uncertainty about the function output. In particular, the {\tt sens} function includes a {\tt model} argument which allows for SA based on any of the prediction models (the {\tt b*} functions) in {\tt tgp}. Our Monte Carlo integration is based upon Saltelli's \cite{Salt2002} efficient Latin hypercube sampling (LHS) scheme for estimation of both 1$^{\rm st}$ order and total effect indices. We note that the estimation is only valid for uncorrelated inputs, such that $u(\mb{x}) = \prod_{j=1}^d u_j(x_j)$. The {\tt sens} function only allows for uncertainty distributions of this type (in fact, the marginal distributions also need to be bounded), but this is a feature of nearly every ``out-of-the-box'' approach to SA. Studies which concern correlated inputs will inevitably require modeling for this correlation, whereas most regression models (including those in {\tt tgp}) condition on the inputs and ignore the joint density for $\mb{x}$. Refer to the work of Saltelli \& Tarantola \cite{SaltTara2002} for an example of SA with correlated inputs. We now briefly describe the integration scheme. The 2nd moment is a useful intermediate quantity in variance estimation, and we define \[ D_J = \mbb{E}\left[\mbb{E}^2\left[z|\mb{x}_{J}\right]\right] = \int_{\mbb{R}^{d_J}} \mbb{E}^2\left[z| {\mb{x}_J}\right]u_J(\mb{x}_J)d(\mb{x}_J). \] Making use of an auxiliary variable, \begin{eqnarray*} D_J &=& \int_{\mbb{R}^{d_J}} \left[\int_{\mbb{R}^{d_{-J}}} \!\!\!\mbb{E}\left[ z | \mb{x}_J, \mb{x}_{-J} \right]u_{-J}(\mb{x}_{-J})d\mb{x}_{-J} \int_{\mbb{R}^{d_{-J}}} \!\!\!\mbb{E}\left[ z | \mb{x}_J, \mb{x}'_{-J} \right] u_{-J}(\mb{x}'_{-J})d\mb{x}'_{-J}\right]u_J(\mb{x}_J)\mb{x}_{J}\\ &=& \int_{\mbb{R}^{d + d_{-J}}} \!\!\mbb{E}\left[ z | \mb{x}_J, \mb{x}_{-J} \right]\mbb{E}\left[ z | \mb{x}_J, \mb{x}'_{-J} \right] u_{-J}(\mb{x}_{-J})u_{-J}(\mb{x}'_{-J})u_{J}(\mb{x}_{J})d\mb{x}d\mb{x}'_{J}. \end{eqnarray*} Thus, in the case of independent inputs, \[ D_J = \int_{\mbb{R}^{d+d_{-J}}} \mbb{E}\left[ z |\mb{x} \right]\mbb{E}\left[ z | \mb{x}_J, \mb{x}'_{-J} \right] u_{-J}(\mb{x}'_{-J})u({\bf x})d\mb{x}'_{-J}d\mb{x}. \] Note that at this point, if the inputs had been correlated, the integral would have been instead with respect to the joint density $u(\mb{x})u(\mb{x}_{-J}' | \mb{x}_J)$, leading to a more difficult integral estimation problem. Recognizing that $S_j = (D_j-\mbb{E}^2[z])/\mr{var}(z)$ and $T_j = 1- \left( \left(D_{-j} - \mbb{E}^2[z]\right)\right)/\mr{var}(z)$, we need estimates of $\mr{var}(z)$, $\mbb{E}^2[z]$, and $\{ (D_j, D_{-j}) : j=1,\ldots,d \}$ to calculate the sensitivity indices. Given a LHS $M$ proportional to $u(\mb{x})$, \begin{equation*} M = \left[ \begin{array}{c} s_{1_1} ~ \cdots ~ s_{1_d}\\ \vdots \\ s_{m_1} ~ \cdots ~ s_{m_d}\\ \end{array} \right], \end{equation*} it is possible to estimate $\widehat{\mbb{E}[z]} = \frac{1}{m} \sum_{k=1}^m\mbb{E}[z|{\bf s}_k]$ and $\widehat{\mr{var}[z]} = \frac{1}{m} \mbb{E}^T[z|M]\mbb{E}[z|M] - \widehat{\mbb{E}[z]}\widehat{\mbb{E}[z]}$, where the convenient notation $\mbb{E}[z|M]$ is taken to mean $\left[\mbb{E}[z|\mb{s}_1] \cdots \mbb{E}[z|\mb{s}_m]\right]^T$. All that remains is to estimate the $D$'s. Define a second LHS $M'$ proportional to $u$ of the same size as $M$ and say that $N_J$ is $M'$ with the $J$ columns replaced by the corresponding columns of $M$. Hence, \begin{equation*} N_j = \left[ \begin{array}{c} s'_{1_1} \cdots s_{1_j} \cdots s'_{1_d}\\ \vdots \\ s'_{m_1} \cdots s_{m_j} \cdots s'_{m_d} \end{array}\right]~~~\mr{and}~~~ N_{-j} = \left[ \begin{array}{c} s_{1_1} \cdots s'_{1_j} \cdots s_{1_d}\\ \vdots \\ s_{m_1} \cdots s'_{m_j} \cdots s_{m_d} \end{array}\right]. \end{equation*} The estimates are then $\hat D_j = \mbb{E}^T[z|M]\mbb{E}[z|N_{j}]/(m-1)$ and $\hat D_{-j}$ $=$ $\mbb{E}^T[z|M']\mbb{E}[z|N_{j}]/(m-1)$ $\approx$ $ \mbb{E}^T[z|M]\mbb{E}[z|N_{-j}]/(m-1)$. Along with the variance and expectation estimates, these can be plugged into equations for $S_j$ and $T_j$ in (\ref{eq:S}--\ref{eq:T}) to obtain $\hat S_j$ and $\hat T_j$. Note that Saltelli recommends the use of the alternative estimate $\widehat{\mbb{E}^2[z]} = \frac{1}{n-1}\mbb{E}^T[z|M]\mbb{E}[z|M']$ in calculating 1$^{\rm st}$ order indices, as this brings the index closer to zero for non-influential variables. However, it has been our experience that these biased estimates can be unstable, and so {\tt tgp} uses the standard $\widehat{\mbb{E}^2[z]} = \widehat{\mbb{E}[z]}\widehat{\mbb{E}[z]}$ throughout. As a final point, we note that identical MCMC sampling-based integration schemes can be used to estimate other Sobol indices (e.g., second order, etc) for particular combinations of inputs, but that this would require customization of the {\tt tgp} software. The set of input locations which need to be evaluated for each calculation of the indices is $\{ M, M', N_1,\ldots,N_d \}$, and if $m$ is the sample size for the Monte Carlo estimate this scheme requires $m(d+2)$ function evaluations. Hence, at each MCMC iteration of the model fitting, the $m(d+2)$ locations are drawn randomly according the LHS scheme, creating a random prediction matrix, {\tt XX}. By allowing random draws of the input locations, the Monte Carlo error of the integral estimates will be included in the posterior variability of the indices and the posterior moments will not be dependent upon any single estimation input set. Using predicted output over this input set, a single realization of the sensitivity indices is calculated through Saltelli's scheme. At the conclusion of the MCMC, we have a representative sample from the posterior for ${\bf S}$ and ${\bf T}$. The averages for these samples are unbiased estimates of the posterior mean, and the variability of the sample is representative of the complete uncertainty about model sensitivity. Since a subset of the predictive locations ($M$ and $M'$) are actually a LHS proportional to the uncertainty distribution, we can also estimate the main effects at little extra computational cost. At each MCMC iteration, a one--dimensional nonparametric regression is fit through the scatterplot of $[s_{1_j}, \ldots, s_{m_j},s'_{1_j}, \ldots, s'_{m_j}]$ vs. $[\mbb{E}[z|M],\mbb{E}[z|M']]$ for each of the $j=1,\ldots,d$ input variables. The resultant regression estimate provides a realization of $\mbb{E}[z|x_j]$ over a grid of $x_j$ values, and therefore a posterior draw of the main effect curve. Thus, at the end of the MCMC, we have not only unbiased estimates of the main effects through posterior expectation, but also a full accounting of our uncertainty about the main effect curve. This technique is not very sensitive to the method of non-parametric regression, since $2m$ will typically represent a very large sample in one--dimension. The estimation in {\tt tgp} uses a moving average with squared distance weights and a window containing the {\tt span}$*2m$ nearest points (the {\tt span} argument defaults to 0.3). \subsection{Examples} We illustrate the capabilities of the {\tt sens} function by looking at the Friedman function considered earlier in this vignette. The function that describes the responses ($Z$), observed with standard Normal noise, has mean \begin{equation} E(Z|\mb{x}) = 10 \sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5 x_5. \label{eq:f1} \end{equation} A sensitivity analysis can be based upon any of the available regression models (e.g., {\tt btlm}, {\tt bgp}, or {\tt btgp}); we choose to specify {\tt model=btgpllm} for this example. The size of each LHS used in the integration scheme is specified through {\tt nn.lhs}, such that this is equivalent to $m$ in the above algorithm description. Thus the number of locations used for prediction---the size of the random {\tt XX} prediction matrix---is {\tt nn.lhs*(ncol(X)+2)}. In addition, the window for moving average estimation of the main effects is {\tt span*2*nn.lhs} (independent of this, an {\tt ngrid} argument with a default setting of {\tt ngrid=100} dictates the number of grid points in each input dimension upon which main effects will be estimated). <<>>= f <- friedman.1.data(250) @ This function actually generates 10 covariates, the last five of which are completely un-influential. We'll include one of these ($x_6$) to show what the sensitivity analysis looks like for unrelated variables. <<>>= Xf <- f[, 1:6] Zf <- f$Y sf <- sens(X=Xf, Z=Zf, nn.lhs=600, model=bgpllm, verb=0) @ The progress indicators printed to the screen (for {\tt verb > 0}) are the same as would be obtained under the specified regression {\tt model}---{\tt bgpllm} in this case---so we suppress them here. All of the same options (e.g., {\tt BTE}, {\tt R}, etc.) apply, although if using the {\tt trace} capabilities one should be aware that the {\tt XX} matrix is changing throughout the MCMC. The {\tt sens} function returns a \verb!"tgp"!-class object, and all of the SA related material is included in the {\tt sens} list within this object. <<>>= names(sf$sens) @ The object provides the SA parameters ({\tt par}), the grid of locations for main effect prediction ({\tt Xgrid}), the mean and interval estimates for these main effects ({\tt ZZ.mean}, {\tt ZZ.q1}, and {\tt ZZ.q2}), and full posterior via samples of the sensitivity indices ({\tt S} and {\tt T}). The plot function for \verb!"tgp"!-class objects now provides a variety of ways to visualize the results of a sensitivity analysis. This capability is accessed by specifying {\tt layout="sens"} in the standard {\tt plot} command. By default, the mean posterior main effects are plotted next to boxplot summaries of the posterior sample for each $S_j$ and $T_j$ index, as in Figure \ref{full}. \begin{figure}[ht!] <>= plot(sf, layout="sens", legendloc="topleft") @ <>= graphics.off() @ \includegraphics[width=6.5in,trim=0 10 0 10]{tgp2-sens-full} \caption{Full sensitivity analysis results for the Friedman function.} \label{full} \end{figure} A further note on the role played by {\tt nn.lhs}: As always, the quality of the regression model estimate depends on the length of the MCMC. But now, the quality of sensitivity analysis is directly influenced by the size of the LHS used for integral approximation; as with any Monte Carlo integration scheme, the sample size (i.e., {\tt nn.lhs}) must increase with the dimensionality of the problem. In particular, it can be seen in the estimation procedure described above that the total sensitivity indices (the $T_j$'s) are not forced to be non-negative. If negative values occur it is necessary to increase {\tt nn.lhs}. In any case, the {\tt plot.tgp} function changes any of the negative values to zero for purposes of illustration. The {\tt maineff} argument can be used to plot either selected main effects (Figure \ref{mains}), \begin{figure}[ht!] <>= par(mar=c(4,2,4,2), mfrow=c(2,3)) plot(sf, layout="sens", maineff=t(1:6)) @ <>= graphics.off() @ \centering \includegraphics[width=6.6in]{tgp2-sens-mains} \caption{Friedman function main effects, with posterior 90\% intervals.} \label{mains} \end{figure} or just the sensitivity indices (Figure \ref{indices}). \begin{figure}[ht!] <>= plot(sf, layout="sens", maineff=FALSE) @ <>= graphics.off() @ \centering \includegraphics[width=6.5in,trim=0 15 0 15]{tgp2-sens-indices} \caption{Sensitivity indices for the Friedman function.} \label{indices} \end{figure} Note that the posterior intervals shown in these plots represent uncertainty about both the function response and the integration estimates; this full quantification of uncertainty is not presently available in any alternative SA procedures. These plots may be compared to what we know about the Friedman function (refer to Eq.~(\ref{eq:f1})) to evaluate the analysis. The main effects correspond to what we would expect: sine waves for $x_1$ and $x_2$, a parabola for $x_3$, and linear effects for $x_4$ and $x_5$. The sensitivity indices show $x_1$ and $x_2$ contributing roughly equivalent amounts of variation, while $x_4$ is relatively more influential than $x_5$. Full effect sensitivity indices for $x_3$, $x_4$, and $x_5$ are roughly the same as the first order indices, but (due to the interaction in the Friedman function) the sensitivity indices for the total effect of $x_1$ and $x_2$ are significantly larger than the corresponding first order indices. Finally, our SA is able to determine that $x_6$ is unrelated to the response. This analysis assumes the default uncertainty distribution, which is uniform over the range of input data. In other scenarios, it is useful to specify an informative $u(\mb{x})$. In the {\tt sens} function, properties of $u$ are defined through the {\tt rect}, {\tt shape}, and {\tt mode} arguments. To guarantee integrability of our indices, we have restricted ourselves to bounded uncertainty distributions. Hence, {\tt rect} defines these bounds. In particular, this defines the domain from which the LHSs are to be taken. We then use independent scaled beta distributions, parameterized by the {\tt shape} parameter and distribution {\tt mode}, to define an informative uncertainty distribution over this domain. As an example of sensitivity analysis under an informative uncertainty distribution, consider the {\tt airquality} data available with the base distribution of {\sf R}. This data set contains daily readings of mean ozone in parts per billion ({\it Ozone}), solar radiation ({\it Solar.R}), wind speed ({\it Wind}), and maximum temperature ({\it Temp}) for New York City, between May 1 and September 30, 1973. Suppose that we are interested in the sensitivity of air quality to natural changes in {\it Solar.R},{\it Wind}, and {\it Temp}. For convenience, we will build our uncertainty distribution while assuming independence between these inputs. Hence, for each variable, the input uncertainty distribution will be a scaled beta with {\tt shape=2}, and {\tt mode} equal to the data mean. <<>>= X <- airquality[,2:4] Z <- airquality$Ozone rect <- t(apply(X, 2, range, na.rm=TRUE)) mode <- apply(X , 2, mean, na.rm=TRUE) shape <- rep(2,3) @ LHS samples from the uncertainty distribution are shown in Figure (\ref{udraw}) \begin{figure}[ht!] <>= Udraw <- lhs(300, rect=rect, mode=mode, shape=shape) par(mfrow=c(1,3), mar=c(4,2,4,2)) for(i in 1:3){ hist(Udraw[,i], breaks=10,xlab=names(X)[i], main="",ylab="", border=grey(.9), col=8) } @ <>= graphics.off() @ \centering \includegraphics[width=6in,trim=0 0 0 30]{tgp2-sens-udraw} \caption{A sample from the marginal uncertainty distribution for the airquality data.} \label{udraw} \end{figure} Due to missing data (discarded in the current version of {\tt tgp}), we suppress warnings for the sensitivity analysis. We shall use the default {\tt model=btgp}. <<>>= s.air <- suppressWarnings(sens(X=X, Z=Z, nn.lhs=300, rect=rect, shape=shape, mode=mode, verb=0)) @ Figure (\ref{air1}) shows the results from this analysis. \begin{figure}[ht!] <>= plot(s.air, layout="sens") @ <>= graphics.off() @ \centering \includegraphics[width=6.5in,trim=0 15 0 15]{tgp2-sens-air1} \caption{Sensitivity of NYC airquality to natural variation in wind, sun, and temperature.} \label{air1} \end{figure} Through use of {\tt predict.tgp}, it is possible to quickly re-analyze with respect to a new uncertainty distribution without running new MCMC. We can, for example, look at sensitivity for air quality on only low--wind days. We thus alter the uncertainty distribution (assuming that things remain the same for the other variables) <<>>= rect[2,] <- c(0,5) mode[2] <- 2 shape[2] <- 2 @ and build a set of parameters {\tt sens.p} with the {\tt sens} function by setting {\tt model=NULL}. <<>>= sens.p <- suppressWarnings(sens(X=X,Z=Z,nn.lhs=300, model=NULL, rect=rect, shape=shape, mode=mode)) @ \begin{figure}[ht!] <>= s.air2 <- predict(s.air, BTE=c(1,1000,1), sens.p=sens.p, verb=0) plot(s.air2, layout="sens") @ <>= graphics.off() @ \includegraphics[width=6.5in,trim=0 15 0 15]{tgp2-sens-air2} \caption{Air quality sensitivity on low-wind days.} \label{air2} \end{figure} Figures (\ref{air1}) and (\ref{air2}) both show total effect indices which are much larger than the respective first order sensitivities. As one would expect, the effect on airquality is manifest largely through an interaction between variables. Finally, it is also possible to perform SA with binary covariates, included in the regression model as described in Section 1. In this case, the uncertainty distribution is naturally characterized by a Bernoulli density. Setting {\tt shape[i]=0} informs {\tt sens} that the relevant variable is binary (perhaps encoding a categorical input as in Section \ref{sec:cat}), and that the Bernoulli uncertainty distribution should be used. In this case, the {\tt mode[i]} parameter dictates the probability parameter for the Bernoulli, and we must have {\tt rect[i,] = c(0,1)}. As an example, we re-analyze the original air quality data with temperature included as an indicator variable (set to one if temperature > 79, the median, and zero otherwise). <<>>= X$Temp[X$Temp >70] <- 1 X$Temp[X$Temp >1] <- 0 rect <- t(apply(X, 2, range, na.rm=TRUE)) mode <- apply(X , 2, mean, na.rm=TRUE) shape <- c(2,2,0) s.air <- suppressWarnings(sens(X=X, Z=Z, nn.lhs=300, rect=rect, shape=shape, mode=mode, verb=0, basemax=2)) @ \begin{figure}[ht!] <>= plot(s.air, layout="sens") @ <>= graphics.off() @ \centering \includegraphics[width=6.5in,trim=0 15 0 15]{tgp2-sens-air3} \caption{Sensitivity of NYC airquality to natural variation in wind, sun, and a binary temperature variable (for a threshold of 79 degrees).} \label{air3} \end{figure} Figure (\ref{air3}) shows the results from this analysis. \section{Statistical search for optimization} \label{sec:optim} <>= seed <- 0; set.seed(seed) @ There has been considerable recent interest in the use of statistically generated search patterns (i.e., locations of relatively likely optima) for optimization. A popular approach is to estimate a statistical (surrogate) model, and use it to design a set of well-chosen candidates for further evaluation by a direct optimization routine. Such statistically designed search patterns can be used either to direct the optimization completely (e.g., \cite{JoneSchoWelc1998} or \cite{RommShoe2007}) or to work in hybrid with local pattern search optimization (as in \cite{TaddLeeGrayGrif2009}). An bonus feature of the statistical surrogate approach is that it may be used to tackle problems of optimization under uncertainty, wherein the function being optimized is observed with noise. In this case the search is for input configurations which optimize the response with high probability. Direct-search methods would not apply in this scenario without modification. However, a sensible hybrid could involve inverting the relationship between the two approaches so that direct-search is used on deterministic predictive surfaces from the statistical surrogate model. This search can be used to find promising candidates to compliment space-filling ones at which some statistical improvement criterion is evaluated. Towards situating {\tt tgp} as a promising statistical surrogate model for optimization (in both contexts) the approach developed by Taddy, et al.~\cite{TaddLeeGrayGrif2009}, has been implemented to produce a list of input locations that is ordered by a measure of the potential for new optima. The procedure uses samples from the posterior predictive distribution of treed GP regression models to estimate improvement statistics and build an ordered list of search locations which maximize expected improvement. The single location improvement is defined $I(\mb{x}) = \mathrm{max}\{f_{min}-f(\mb{x}),0\}$, where $f_{min}$ is the minimum evaluated response in the search (refer to \cite{SchoWelcJone1998} for extensive discussion on general improvement statistics and initial vignette~\cite{gramacy:2007} for details of a base implementation in {\tt tgp}). Thus, a high improvement corresponds to an input location that is expected to be much lower than the current minimum. The criterion is easily changed to a search for maximum values through negation of the response. The improvement is always non-negative, as points which do not turn out to be new minimum points still provide valuable information about the output surface. Thus, in the expectation, candidate locations will be rewarded for high response uncertainty (indicating a poorly explored region of the input space), as well as for low mean predicted response. Our {\tt tgp} generated search pattern will consist of $m$ locations that recursively maximize (over a discrete candidate set) a sequential version of the expected multi-location improvement developed by Schonlau, et al.~\cite{SchoWelcJone1998}, defined as $\mbb{E}\left[I^g(\mb{x}_1, \ldots, \mb{x}_m)\right]$ where \begin{equation} \label{eqn:imult} I^g(\mb{x}_1, \ldots, \mb{x}_m) = \left(\mathrm{max}\{(f_{min}-f(\mb{x}_1)), \ldots, (f_{min}-f(\mb{x}_m)), 0 \}\right)^g. \end{equation} Increasing $g \in \{0,1,2,3,\ldots\}$ increases the global scope of the criteria by rewarding in the expectation extra variability at $\mb{x}$. For example, $g=0$ leads to $\mbb{E}[I^0(\mb{x})] = \Pr(I(\mb{x})>0)$ (assuming the convention $0^0=0$), $g=1$ yields the standard statistic, and $g=2$ explicitly rewards the improvement variance since $\mbb{E}[I^2(\mb{x})] = \mr{var}[I(\mb{x})] + \mbb{E}[I(\mb{x})]^2$. For further discussion on the role of $g$, see \cite{SchoWelcJone1998} . Finding the maximum expectation of (\ref{eqn:imult}) is practically impossible for the full posterior distribution of $I^g(\mb{x}_1, \ldots, \mb{x}_m)$, and would require conditioning on a single fit for the model parameters (for example, static imputation of predictive GP means can be used to recursively build the improvement set \cite{GinsLe-RCarr2009}). However, {\tt tgp} just seeks to maximize over a discrete list of predictive locations. In fact, the default is to return an ordering for the entire {\tt XX} matrix, thus defining a ranking of predictive locations by order of decreasing expected improvement. There is no restriction on the form for {\tt XX}.\footnote{A full optimization routine would require that the search pattern is placed within an algorithm iterating towards convergence, as in \cite{TaddLeeGrayGrif2009}. However, we concentrate here on the statistical problem of choosing the next samples optimally. We shall touch on issues of convergence in Section \ref{sec:optimskel} where we describe a skeleton scheme for optimization extending {\sf R}'s internal {\tt optim} functionality.} The structure of this scheme will dictate the form for {\tt XX}. If it is the case that we seek simply to explore the input space and map a list of potential locations for improvement, using LHS to choose {\tt XX} will suffice. The discretization of decision space allows for a fast iterative solution to the optimization of $\mbb{E}\left[I^g(\mb{x}_1, \ldots, \mb{x}_m)\right]$. This begins with evaluation of the simple improvement $I^g(\tilde{\mb{x}}_i)$ over $\tilde{\mb{x}}_i \in {\bf \tilde X}$ at each of $T=$ {\tt BTE[2]-BTE[1]} MCMC iterations (each corresponding to a single posterior realization of {\tt tgp} parameters and predicted response after burn-in) to obtain the posterior sample \begin{equation*} \mathcal{I} = \left\{ \begin{array}{rcl} I^g( \tilde{\mb{x}}_1)_1& \ldots& I^g(\tilde{\mb{x}}_m)_1\\ &\vdots& \\ I^g( \tilde{\mb{x}}_1)_T& \ldots& I^g(\tilde{\mb{x}}_m)_T \end{array}\right\}. \end{equation*} Recall that in {\tt tgp} parlance, and as input to the {\tt b*} functions: $\tilde{\mb{X}}\equiv $ {\tt XX}. We then proceed iteratively to build an {\it ordered} collection of $m$ locations according to an iteratively refined improvement: Designate $\mb{x}_1 = \mathrm{argmax}_{\tilde{\mb{x}} \in {\bf \tilde X}} \mbb{E}\left[I^g( \tilde{\mb{x}})\right]$, and for $j=2,\ldots,m$, given that $\mb{x}_1, \ldots, \mb{x}_{j-1}$ are already included in the collection, the next member is \begin{eqnarray*} \mb{x}_j &=& \mathrm{argmax}_{\tilde{\mb{x}} \in {\bf \tilde X}} \mbb{E}\left[ \mathrm{max}\{I^g( \mb{x}_1, \ldots, \mb{x}_{j-1}), I^g(\tilde{\mb{x}}) \} \right]\\ &=& \mathrm{argmax}_{\tilde{\mb{x}} \in {\bf \tilde X}} \mbb{E}[\left(\mathrm{max}\{(f_{min}-f(\mb{x}_1)), \ldots, (f_{min}-f(\mb{x}_{j-1})), (f_{min}-f(\tilde{\mb{x}})), 0\}\right)^g ] \\ &=& \mathrm{argmax}_{\tilde{\mb{x}} \in {\bf \tilde X}} \mbb{E}\left[I^g(\mb{x}_1, \ldots, \mb{x}_{j-1},\tilde{\mb{x}})\right]. \end{eqnarray*} Thus, after each $j^{\rm th}$ additional point is added to the set, we have the maximum expected $j$--location improvement conditional on the first $j-1$ locations. This is not necessarily the unconditionally maximal expected $j$--location improvement; instead, point $\mb{x}_j$ is the location which will cause the greatest increase in expected improvement over the given $(j-1)$--location expected improvement. The posterior sample $\mathcal{I}$ acts as a discrete approximation to the true posterior distribution for improvement at locations within the candidate set {\tt XX}. Based upon this approximation, iterative selection of the point set is possible without any re-fitting of the {\tt tgp} model. Conditional on the inclusion of $\tilde{\mb{x}}_{i_1},\ldots,\tilde{\mb{x}}_{i_{l-1}}$ in the collection, a posterior sample of the $l$--location improvement statistics is calculated as \begin{equation*} \mathcal{I}_l = \left\{ \begin{array}{rcl} I^g( \tilde{\mb{x}}_{i_1},\ldots,\tilde{\mb{x}}_{i_{l-1}}, \tilde{\mb{x}}_1)_1 & \ldots& I^g( \tilde{\mb{x}}_{i_1},\ldots,\tilde{\mb{x}}_{i_{l-1}}, \tilde{\mb{x}}_m)_1\\ &\vdots& \\ I^g(\tilde{\mb{x}}_{i_1},\ldots,\tilde{\mb{x}}_{i_{l-1}}, {\tilde x}_1)_T& \ldots& I^g(\tilde{\mb{x}}_{i_1},\ldots,\tilde{\mb{x}}_{i_{l-1}},\tilde{\mb{x}}_m)_T \end{array}\right\}, \end{equation*} where the element in the $t^{\rm th}$ row and $j^{\rm th}$ column of this matrix is calculated as max$\{I^g(\tilde{\mb{x}}_{i_1}$, $\ldots,$ $\tilde{\mb{x}}_{i_{l-1}})_t$, $I^g(\tilde{\mb{x}}_j)_t\}$ and the $l^{\rm th}$ location included in the collection corresponds to the column of this matrix with maximum average. Since the multi-location improvement is always at least as high as the improvement at any subset of those locations, the same points will not be chosen twice for inclusion. In practice, very few iterations (about 10\% of the total candidate size under the default inference and regression model(s)) through this ordering process can be performed before the iteratively updated improvement statistics become essentially zero. Increasing the number of MCMC iterations ({\tt BTE[2]-BTE[1]}) can mitigate this to a large extent.\footnote{Once a zero (maximal) iterative improvement is attained the rest of the ranking is essentially arbitrary, at which point {\tt tgp} cuts off the process prematurely.} We refer the reader to \cite{TaddLeeGrayGrif2009} for further details on this approach to multi-location improvement search. \subsection{A simple example} We shall use the Rosenbrock function to illustrate the production of an ordered collection of (possible) adaptive samples to maximize the expected improvement within {\tt tgp}. Specifically, the two dimensional Rosenbrock function is defined as <<>>= rosenbrock <- function(x){ x <- matrix(x, ncol=2) 100*(x[,1]^2 - x[,2])^2 + (x[,1] - 1)^2 } @ and we shall bound the search space for adaptive samples to the rectangle: $-1\le x_i \le 5$ for $i=1,2$. The single global minimum of the Rosenbrock function is at $(1,1)$. <<>>= rosenbrock(c(1,1)) @ This function involves a long steep valley with a gradually sloping floor, and is considered to be a difficult problem for local optimization routines. We begin by drawing an LHS of 40 input locations within the bounding rectangle, and evaluating the function at these locations. <<>>= rect <- cbind(c(-1,-1),c(5,5)) X <- lhs(40, rect) Z <- rosenbrock(X) @ We will fit a {\tt bgp} model to this data to predict the Rosenbrock response at unobserved (candidate) input locations in {\tt XX}. The {\tt improv} argument may be used to obtain an ordered list of places where we should be looking for new minima. In particular, specifying {\tt improv=c(1,10)} will return the 10 locations which maximize the iterative multi-location expected improvement function, with $g=1$ (i.e., Eq.~(\ref{eqn:imult})). Note that {\tt improv=TRUE} is also possible, in which case {\tt g} defaults to one and the entire list of locations is ranked. Our candidate set is just a space filling LHS design. In other situations, it may be useful to build an informative LHS design (i.e., to specify {\tt shape} and {\tt mode} arguments for the {\tt lhs} function) to reflect what is already known about the location of optima. <<>>= XX <- lhs(200, rect) rfit <- bgp(X,Z,XX,improv=c(1,10), verb=0) @ Upon return, the \verb!"tgp"!-class object {\tt rfit} includes the matrix {\tt improv}, which is a list of the expected single location improvement for the 200 {\tt XX} locations, and the top 10 ranks. Note that the {\tt rank}s for those points which are not included in the top 10 are set to {\tt nrow(XX)=}\Sexpr{nrow(XX)}. Here are the top 10: <<>>= cbind(rfit$improv,XX)[rfit$improv$rank <= 10,] @ This iterative algorithm may produce ranks that differ significantly from a straightforward ordering of expected improvement. This leads to a list that better explores the input space, since the expected improvement is naturally balanced against a desire to search the domain. We plot the results with the usual function, by setting {\tt as="improv"}, in Figure \ref{optim-fit1}. \begin{figure}[htb!] <>= plot(rfit, as="improv") @ <>= graphics.off() @ \centering \includegraphics[width=6.5in,trim=0 25 0 25]{tgp2-optim-fit1} \caption{The {\em left} panel shows the mean predicted Rosenbrock function response, and on the {\em right} we have expected single location improvement with the top 10 points (labelled by rank) plotted on top.} \label{optim-fit1} \end{figure} The banana--shaped region of higher expected improvement corresponds to the true valley floor for the Rosenbrock function, indicating the that {\tt bgp} model is doing a good job of prediction. Also, we note that the ordered input points are well dispersed throughout the valley---a very desirable property for adaptive sampling candidates. It is straightforward, with {\tt predict.tgp}, to obtain a new ordering for the more global {\tt g=5} (or any new {\tt g}). Figure \ref{optim-fit2} shows a more diffuse expected improvement surface and a substantially different point ordering. In practice, we have found that {\tt g=2} provides a good compromise between local and global search. \begin{figure}[htb!] <>= rfit2 <- predict(rfit, XX=XX, BTE=c(1,1000,1), improv=c(5,20), verb=0) plot(rfit2, layout="as", as="improv") @ <>= graphics.off() @ \centering \includegraphics[width=3.25in,trim=0 25 0 25]{tgp2-optim-fit2} \caption{The expected improvement surface and top 20 ordered locations, for {\tt g=5}.} \label{optim-fit2} \end{figure} \subsection{A skeleton optimization scheme} \label{sec:optimskel} %% The nature of global optimization demands that a fine balance be %% struck between global and local search. Therefore, designing a %% one--size--fits--all approach would be a daunting task. For one %% thing, assessing convergence in any formal sense would be quite %% difficult, although in practice it would be straightforward to %% ``force'' convergence by (eventually) focusing the method on finding a %% local solution. In the case where the function evaluations are %% deterministic, final convergence to a the local solution is always %% possible through the use of {\tt R}'s {\tt optim} function, for %% example. A method using {\tt tgp} based on a similar, but more %% formalized approach, using a direct/pattern search (in place of {\tt %% optim}) has been recently demonstrated in the context of %% sequentially designing computer experiments to solve an optimization %% problem \cite{TaddLeeGrayGrif2009}. Generally speaking, the result is %% a sensible compromise between local and global search. When the %% function evaluations are noisy one can always create a deterministic %% approximation, i.e., via the MAP predictive distribution (i.e., a %% kriging surrogate), for use with {\tt optim} in order to obtain %% convergence to a local optima. %% %% It may be possible to base assessments of convergence on the %% improvement statistic, which would naturally tend to zero as more %% points are added into the design. But any such assessments would hinge %% upon being able to drive the (Monte Carlo) method used to infer the %% model parameters---on which the improvement statistic is based---to a %% fixed point. In the context of MCMC this is only guaranteed as the %% number of samples gathered tends to infinity. Even if obtaining %% asymptotic convergence in this way is clearly a pipe dream, the %% practical application of this idea, and those based on local %% optimization mentioned above, can still bear fruit. Insight into %% convergence in practice is still a very tangible concept. Moreover, %% for many applications the considerations of convergence may even take %% a back seat to other budgetary constraints where the efficient %% allocation of an available resource (say computer cycles) is more %% important than a bottom--line based upon convergence which may only be %% achieved at all costs in the best of scenarios. The capabilities outlined above are useful in their own right, as a search list or candidate set ranked by expected improvement gain provides concrete information about potential optima. However, a full optimization framework requires that the production of these sets of search locations are nested within an iterative search scheme. The approach taken by Taddy, et al.~\cite{TaddLeeGrayGrif2009}, achieves this by taking the {\tt tgp} generated sets of locations and using them to augment a local optimization search algorithm. In this way, the authors are able to achieve robust solutions which balance the convergence properties of the local methods with the global scope provided by {\tt tgp}. Indeed, any optimization routine capable of evaluating points provided by an outside source could benefit from a {\tt tgp} generated list of search locations. In the absence of this sort of formal hybrid search algorithm, it is still possible to devise robust optimization algorithms based around {\tt tgp}. A basic algorithm is as follows: first, use a LHS to explore the input space (see the {\tt lhs} function included in {\tt tgp}). Repeatedly fit one of the {\tt b*} models with {\tt improv!=FALSE} to the evaluated iterates to produce a search set, then evaluate the objective function over this search set, as described earlier. Then evaluate the objective function over the highest ranked locations in the search set. Continue until you are confident that the search has narrowed to a neighborhood around the true optimum (a good indicator of this is when all of the top-ranked points are in the same area). At this point, the optimization may be completed by {\tt optim}, {\sf R}'s general purpose local optimization algorithm in order to guarentee convergence. The {\tt optim} routine may be initialized to the best input location (i.e. corresponding the most optimal function evaluation) found thus far by {\tt tgp}. Note that this approach is actually an extreme version of a template proposed by Taddy, et al.~\cite{TaddLeeGrayGrif2009}, where the influence of global (i.e. {\tt tgp}) search is downweighted over time rather than cut off. In either case, a drawback to such approaches is that they do not apply when the function being optimized is deterministic. An alternative scheme is to employ both {\tt tgp} search and a local optimization at each iteration. The idea is that a mix of local and global information is provided throughout the entire optimization, but with an added twist. Rather than apply {\tt optim} on the stochastic function directly, which would not converge due to the noise, it can be applied on a deterministic (MAP) kriging surface provided by {\tt tgp}. The local optima obtained can be used to augment the candidate set of locations where the improvement statistic is gathered---which would otherwise be simple LHS. That way the search pattern produced on output is likely to have a candidate with high improvement. To fix ideas, and for the sake of demonstration, the {\tt tgp} package includes a skeleton function for performing a single iteration in the derivative--free optimization of noisy black--box functions. The function is called {\tt optim.step.tgp}, and the name is intended to emphasize that it performs a single step in an optimization by trading off local {\tt optim}--based search of {\tt tgp} predictive (kriging surrogate) surfaces, with the expected posterior improvement. In other words, it is loosely based on some the techniques alluded to above, but is designed to be augmented/adjusted as needed. Given $N$ pairs of inputs and responses $(\mb{X}, \mb{Z})$, {\tt optim.step.tgp} suggests new points at which the function being optimized should be evaluated. It also returns information that can be used to assess convergence. An outline follows. The {\tt optim.step.tgp} function begins by constructing a set of candidate locations, either as a space filling LHS over the input space (the default) or from a treed $D$--optimal design, based on a previously obtained \verb!"tgp"!-class model. {\sf R}'s {\tt optim} command is used on the MAP predictive surface contained within the object to obtain an estimate of the current best guess $\mb{x}$-location of the optimal solution. A standalone subroutine called {\tt optim.ptgpf} is provided for this specific task, to be used within {\tt optim.step.tgp} or otherwise. Within {\tt optim.step.tgp}, {\tt optim.ptgpf} is initialized with the data location currently predicted to be the best guess of the minimum. The optimal $x$-location found is then added into the set of candidates as it is likely that the expected improvement would be high there. Then, a new \verb!"tgp"!-class object is obtained by applying a {\tt b*} function to $(\mb{X}, \mb{Z})$ whilst sampling from the posterior distribution of the improvement statistic. The best one, two, or several locations with highest improvement ranks are suggested for addition into the design. The values of the maximum improvement statistic are also returned in order to track progress in future iterations. The \verb!"tgp"!-class object returned is used to construct candidates and initialize the {\tt optim.ptgpf} function in future rounds. To illustrate, consider the 2-d exponential data from the initial vignette \cite{gramacy:2007} as our noisy function $f$. <<>>= f <- function(x) { exp2d.Z(x)$Z } @ Recall that this data is characterized by a mean value of \[ f(\mb{x}) = x_1 \exp(-x_1^2 - x_2^2) \] which is observed with a small amount of Gaussian noise (with sd $=0.001$). Elementary calculus gives that the minimum of $f$ is obtained at $\mb{x} = (-\sqrt{1/2},0)$. The {\tt optim.step.tgp} function requires that the search domain be defined by a bounding rectangle, and we require an initial design to start things off. Here we shall use $[-2,6]^2$ with an LHS design therein. <<>>= rect <- rbind(c(-2,6), c(-2,6)) X <- lhs(20, rect) Z <- f(X) @ The following code proceeds with several rounds of sequential design towards finding the minimum of {\tt f}. <>= out <- progress <- NULL for(i in 1:20) { ## get recommendations for the next point to sample out <- optim.step.tgp(f, X=X, Z=Z, rect=rect, prev=out, verb=0) ## add in the inputs, and newly sampled outputs X <- rbind(X, out$X) Z <- c(Z, f(out$X)) ## keep track of progress and best optimum progress <- rbind(progress, out$progress) } @ The {\tt progress} can be tracked through the rows of a {\tt data.frame}, as constructed above, containing a listing of the input location of the current best guess of the minimum for each round, together with the value of the objective at that point, as well as the maximum of the improvement statistic. \begin{figure}[ht!] \centering <>= par(mfrow=c(1,2)) matplot(progress[,1:2], main="x progress", xlab="rounds", ylab="x[,1:2]", type="l", lwd=2) legend("topright", c("x1", "x2"), lwd=2, col=1:2, lty=1:2) plot(log(progress$improv), type="l", main="max log improv", xlab="rounds", ylab="max log(improv)") @ <>= graphics.off() @ \includegraphics[trim=40 20 0 0]{tgp2-optim-progress} %\vspace{-0.5cm} \caption{Progress in iterations of {\tt optim.step.tgp} shown by tracking the $\mb{x}$--locations of the best guess of the minimum ({\em left}) and the logarithm of the maximum of the improvement statistics at the candidate locations ({\em right})} \label{f:optim:progress} \end{figure} In addition to printing this data to the screen, plots such as the ones in Figure \ref{f:optim:progress} can be valuable for assessing convergence. As can be seen in the figure, the final iteration gives an $\mb{x}$-value that is very close to the correct result, and is (in some loose sense) close to convergence. <<>>= out$progress[1:2] @ As mentioned above, if it is known that the function evaluations are deterministic then, at any time, {\sf R}'s {\tt optim} routine can be invoked---perhaps initialized by the $\bm{x}$-location in \verb!out$progress!---and convergence to a local optimum thus guaranteed. Otherwise, the quantities in \verb!out$progress! will converge, in some sense, as long as the number of MCMC rounds used in each round, above, ($T=$ {\tt BTE[2]-BTE[1]}) tends to infinity. Such arguments to the {\tt b*} functions can be set via the ellipses ({\tt ...}) arguments to {\tt optim.step.tgp}.\footnote{This runs contrary to how the ellipses are used by {\tt optim} in order to specify static arguments to {\tt f}. If setting static arguments to {\tt f} is required within {\tt optim.step.tgp}, then they must be set in advance by adjusting the default arguments via {\tt formals}.} A heuristic stopping criterion can be based on the maximum improvement statistic obtained in each round as long as the candidate locations become dense in the region as $T\rightarrow \infty$. This can be adjusted by increasing the {\tt NN} argument to {\tt optim.step.tgp}. The internal use of {\tt optim} within {\tt optim.step.tgp} on the posterior predictive (kriging surrogate) surface via {\tt optim.ptgpf} may proceed with any of the usual method arguments. I.e., <<>>= formals(optim)$method @ however the default ordering is switched in {\tt optim.ptgpf} and includes one extra method. <<>>= formals(optim.ptgpf)$method @ Placing \verb!"L-BFGS-B"! in the default position is sensible since this method enforces a rectangle of constraints as specified by {\tt rect}. This guarentees that the additional candidate found by {\tt optim.ptfpf} will be valid. However, the other {\tt optim} methods generally work well despite that they do not enforce this constraint. The final method, \verb!"optimize"!, applies only when the inputs to {\tt f} are 1-d. In this case, the documentation for {\tt optim} suggests using the {\tt optimize} function instead. \section{Importance tempering} \label{sec:it} <>= seed <- 0; set.seed(seed) @ It is well--known that MCMC inference in Bayesian treed methods suffers from poor mixing. For example, Chipman et al.~\cite{chip:geor:mccu:1998,chip:geor:mccu:2002} recommend periodically restarting the MCMC to avoid chains becoming stuck in local modes of the posterior distribution (particularly in tree space). The treed GP models are or no exception, although it is worth remarking that using flexible GP models at the leaves of the tree typically results in shallower trees, and thus less pathalogical mixing in tree space. Version 1.x provided some crude tools to help mitigate the effects of poor mixing in tree space. For example, the {\tt R} argument to the {\tt b*} functions facilitates the restarts suggested by Chipman et al. A modern Monte Carlo technique for dealing with poor mixing in Markov chain methods is to employ {\em tempering} to flatten the peaks and raise the troughs in the posterior distribution so that movements between modes is more fluid. One such method, called {\em simulated tempering} (ST) \cite{geyer:1995}, is essentially the MCMC analogue of the popular simulated annealing algorithm for optimization. The ST algorithm helps obtain samples from a multimodal density $\pi(\theta)$ where standard methods, such as Metropolis--Hastings (MH) \cite{met:1953,hast:1970} and Gibbs Sampling (GS) \cite{geman:1984}, fail. As will be shown in our examples, ST can guard against becoming stuck in local modes of the {\tt tgp} posterior by encouraging better mixing {\em between modes} via in increase in the acceptance rate of tree modification proposals, particularly {\em prunes}. However, as we will see, ST suffers from inefficiency because it discards the lions share of the samples it collects. The discarded samples can be recycled if they are given appropriate importance sampling (IS) \cite{liu:2001} weights. These weights, if combined carefully, can be used to construct meta-estimators of expectations under the {\tt tgp} posterior that have much lower variance compared to ST alone. This combined application of ST and IT is dubbed {\em importance tempering} \cite{gra:samw:king:2009}. \subsection{Simulated Tempering and related methods} \label{sec:st} ST is an application of the MH algorithm on the product space of parameters and inverse temperatures $k\in [0,1]$. That is, ST uses MH to sample from the joint chain $\pi(\theta,k) \propto \pi(\theta)^k p(k)$. The inverse temperature is allowed to take on a discrete set of values $k \in \{k_1,\dots,k_m: k_1 = 1, \; k_i > k_{i+1} \geq 0\}$, called the {\em temperature ladder}. Typically, ST calls for sampling $(\theta,k)^{(t+1)}$ by first updating $\theta^{(t+1)}$ conditional on $k^{(t)}$ and (possibly) on $\theta^{(t)}$, using MH or GS. Then, for a proposed $k' \sim q(k^{(t)} \rightarrow k')$, usually giving equal probability to the nearest inverse temperatures greater and less than $k^{(t)}$, an acceptance ratio is calculated: \[ A^{(t+1)} = \frac{\pi(\theta^{(t+1)})^{k'} p(k') q(k' \rightarrow k^{(t)})}{\pi(\theta^{(t+1)})^{k^{(t)}} p(k^{(t)}) q(k^{(t)}\rightarrow k')}. \] Finally, $k^{(t+1)}$ is determined according to the MH accept/reject rule: set $k^{(t+1)} = k'$ with probability $\alpha^{(t+1)} = \min\{1,A^{(t+1)}\}$, or $k^{(t+1)} = k^{(t)}$ otherwise. Standard theory for MH and GS gives that samples from the marginals $\pi_{k_i}$ can be obtained by collecting samples $\theta^{(t)}$ where $k^{(t)} = k_i$. Samples from $\pi(\theta)$ are obtained when $k^{(t)} = 1$. The success of ST depends crucially on the ability of the Markov chain frequently to: (a) visit high temperatures (low $k$) where the probability of escaping local modes is increased; (b) visit $k=1$ to obtain samples from $\pi$. The algorithm can be tuned by: (i.)~adjusting the number and location of the rungs of the temperature ladder; or (ii.)~setting the pseudo-prior $p(k)$ for the inverse temperature. Geyer \& Thompson \cite{geyer:1995} give ways of adjusting the spacing of the rungs of the ladder so that the ST algorithm achieves between--temperature acceptance rates of 20--40\%. More recently, authors have preferred to rely on defaults, e.g., \begin{equation} \;\;\;\;\; k_i = \left\{ \begin{array}{cl} (1+\Delta_k)^{1-i} & \mbox{geometric spacing}\\ \{1+\Delta_k (i-1)\}^{-1} & \mbox{harmonic spacing} \end{array} \right. \;\;\;\;\ i=1,\dots,m. \label{eq:ladder} \end{equation} Motivation for such default spacings is outlined by Liu \cite{liu:2001}. Geometric spacing, or uniform spacing of $\log(k_i)$, is also advocated by Neal \cite{neal:1996,neal:2001} to encourage the Markov chain to rapidly traverse the breadth of the temperature ladder. Harmonic spacing is more often used by a related method called Metropolis coupled Markov chain Monte Carlo (MC$^3$) \cite{geyer:1991}. Both defaults are implemented in the {\tt tgp} package, through the provided {\tt default.itemps} function. A new ``sigmoidal'' option is also implemented, as discussed below. The rate parameter $\Delta_k>0$ can be problem specific. Rather than work with $\Delta_k$ the {\tt default.itemps} function allows the ladder to be specified via $m$ and the hottest temperature $k_m$, thus fixing $\Delta_k$ implicitly. I.e., for the geometric ladder $\Delta_k = (k_m)^{1/(1-m)}-1$, and for the harmonic ladder $\Delta_k = \frac{(k_m)^{-1}-1}{m-1}$. A sigmoidal ladder can provide a higher concentration of temperatures near $k=1$ without sacrificing the other nice properties of the geometric and harmonic ladders. It is specified by first situating $m$ indices $j_i\in \mathbb{R}$ so that $k_1 = k(j_1) = 1$ and $k_m = k(j_m) = k_{\mbox{\tiny m}}$ under \[ k(j_i) = 1.01 - \frac{1}{1+e^{j_i}}. \] The remaining $j_i, i=2,\dots,(m-1)$ are spaced evenly between $j_1$ and $j_m$ to fill out the ladder $k_i = k(j_i), i=1,\dots,(m-1)$. By way of comparison, consider generating the three different types of ladder with identical minimum inverse temperature $k_{\mbox{\tiny m}} = 0.1$, the default setting in {\tt tgp}. <<>>= geo <- default.itemps(type="geometric") har <- default.itemps(type="harmonic") sig <- default.itemps(type="sigmoidal") @ The plots in Figure \ref{f:itemps} show the resulting inverse temperature ladders, and their logarithms. \begin{figure}[ht!] <>= par(mfrow=c(2,1)) all <- cbind(geo$k, har$k, sig$k) matplot(all, pch=21:23, main="inv-temp ladders", xlab="indx", ylab="itemp") legend("topright", pch=21:23, c("geometric","harmonic","sigmoidal"), col=1:3) matplot(log(all), pch=21:23, main="log(inv-temp) ladders", xlab="indx", ylab="itemp") @ <>= graphics.off() @ \centering \includegraphics[height=5.9in,width=4.5in,trim=0 20 0 20]{tgp2-it-itemps} \caption{Three different inverse temperature ladders, each with $m=40$ temperatures starting at $k_1=1$ and ending at $k_m=0.1$} \label{f:itemps} \end{figure} Observe how, relative to the geometric ladder, the harmonic ladder has a higher concentration of inverse temperatures near zero, whereas the sigmoidal ladder has a higher concentration near one. Once a suitable ladder has been chosen, the {\tt tgp} package implementation of ST follows the suggestions of Geyer \& Thompson \cite{geyer:1995} in setting the pseudo--prior, starting from a uniform $p_0$. First, $p_0$ is adjusted by {\em stochastic approximation}: add $c_0/[m(t+n_0)]$ to $\log p_0(k)$ for each $k_i \ne k^{(t)}$ and subtract $c_0/(t+n_0)$ from $\log p_0(k^{(t)})$ over $t=1,\dots,B$ {\em burn--in} MCMC rounds sampling from the joint posterior of $(\theta, k)$. Then, $p_0$ is normalized to obtain $p_1$. Before subsequent runs, specified via an {\tt R >= 2} argument, {\em occupation numbers} $o(k_i) = \sum_{t=1}^B 1_{\{k^{(t)} = k_i\}}$, are used update $p(k_i) \propto p_1(k_i)/o(k_i)$. Note that, in this setting, the {\tt R} argument is used to update the pseudo--prior only, not to restart the Markov chain. \subsection{Importance sampling from tempered distributions} \label{sec:temp} ST provides us with $\{(\theta^{(t)},k^{(t)}): t = 1,\ldots,T\}$, where $\theta^{(t)}$ is an observation from $\pi_{k^{(t)}}$. It is convenient to write $\mathcal{T}_i = \{t: k^{(t)} = k_i\}$ for the index set of observations at the $i^{\mbox{\tiny th}}$ temperature, and let $T_i = |\mathcal{T}_i|$. Let the vector of observations at the $i^{\mbox{\tiny th}}$ temperature collect in $\bm{\theta}_i = (\theta_{i1},\dots,\theta_{iT_i})$, so that $\{\theta_{ij}\}_{j=1}^{T_i}\sim \pi_{k_i}$. Each vector $\bm{\theta}_i$ can be used to construct an IS estimator of $E_{\pi}\{h(\theta)\}$ by setting \[ \hat{h}_i = \frac{\sum_{j=1}^{T_i} w_i(\theta_{ij}) h(\theta_{ij})} {\sum_{j=1}^{T_i} w_i(\theta_{ij})} \equiv \frac{\sum_{j=1}^{T_i} w_{ij}h(\theta_{ij})}{W_i}, \] say. That is, rather than obtain one estimator from ST (at the cold temperature), we can obtain $m$ estimators (one at each temperature) via IS. The efficiency of each estimator, $i=1,\dots,m$ can be measured through its variance, but unfortunately this can be difficult to calculate in general. As a result, the notion of {\em effective sample size} \cite{liu:2001} (ESS) plays an important role in the study of IS estimators. Denote the vector of IS weights at the $i^{\mbox{\tiny th}}$ temperature as $\mathbf{w}_i = \mathbf{w}_i(\bm{\theta}_i) = (w_i(\theta_{i1}),\ldots,w_i(\theta_{iT_i}))$, where $w_i(\theta) = \pi(\theta)/\pi_{k_i}(\theta)$. The ESS of $\hat{h}_i$ is defined by \begin{equation} \mathrm{ESS}(\mb{w}_i) = \frac{T}{1 + \mathrm{cv^2}(\mathbf{w}_i)}, \label{eq:essw} \end{equation} where $\mathrm{cv}(\mathbf{w}_i)$ is the \emph{coefficient of variation} of the weights (in the $i^{\mbox{\tiny th}}$ temperature), given by \begin{align*} \mathrm{cv^2}(\mathbf{w}_i) &= \frac{\sum_{t=1}^T(w(\theta^{(t)}) - \bar{w})^2}{(T-1) \bar{w}^2}, &\mbox{where} && \bar{w} &= T^{-1} \sum_{t=1}^T w(\theta^{(t)}). \end{align*} In {\sf R}: <<>>= ESS <- function(w) { mw <- mean(w) cv2 <- sum((w-mw)^2)/((length(w)-1)*mw^2) ess <- length(w)/(1+cv2) return(ess) } @ This should not be confused with the concept of \emph{effective sample size due to autocorrelation} \cite{kass:1998} (due to serially correlated samples coming from a Markov chain as in MCMC) as implemented by the {\tt effectiveSize} function in the {\tt coda} package \cite{coda:R} for {\sf R}. Before attempting to combine $m$ IS estimators it is fruitful backtrack briefly to obtain some perspective on the topic of applying IS with a {\em single} tempered proposal distribution. Jennison \cite{jennison:1993} put this idea forward more than a decade ago, although the question of how to choose the best temperature was neither posed or resolved. It is clear that larger $k$ leads to lower variance estimators (and larger ESS), but at the expense of poorer mixing in the Markov chain. It can be shown that the optimal inverse temperature $k^*$ for IS, in the sense of constructing a minimum variance estimator, may be significantly lower than one \cite{gra:samw:king:2009}. However, the variance of such an estimator will indeed become unbounded as $k\rightarrow 0$, just as ESS~$\rightarrow 0$. Needless to say, the choice of how to best pick the best temperatures (for ST or IS) is still an open problem. But in the context of the family of tempered distributions used by ST for mixing considerations, this means that the discarded samples obtained when $k^{(t)} < 1$ may actually lead to more efficient estimators than the ones saved from the cold distribution. So ST is wastefull indeed. However, when combining IS estimators from the multiple temperatures used in ST, the deleterious effect of the high variance ones obtained at high temperature must be mitigated. The possible strategies involved in developing such a meta-estimator comprise the {\em importance tempering} (IT) family of methods. The idea is that small ESS will indicate high variance IS estimators which should be relegated to having only a small influence on the overall estimator. \subsection{An optimal way to combine IS estimators} \label{sec:lambdas} It is natural to consider an overall meta-estimator of $E_{\pi}\{h(\theta)\}$ defined by a convex combination: \begin{align} \label{eq:hhatlambda} \hat{h}_{\lambda} &= \sum_{i=1}^m \lambda_i \hat{h}_i,& \mbox{where} && 0 \leq \lambda_i \leq \sum_{i=1}^m \lambda_i = 1. \end{align} Unfortunately, if $\lambda_1,\dots,\lambda_m$ are not chosen carefully, $\mbox{Var}(\hat{h}_\lambda)$, can be nearly as large as the largest $\mbox{Var}(\hat{h}_i)$ \cite{owen:2000}, due to the considerations alluded to in Section \ref{sec:temp}. Notice that ST is recovered as a special case when $\lambda_1=1$ and $\lambda_2,\dots,\lambda_m = 0$. It may be tempting to choose $\lambda_i = W_i/W$, where $W = \sum_{i=1}^m W_i$. The resulting estimator is equivalent to \begin{align} \label{Eq:hath} \hat{h} &= W^{-1} \sum_{t=1}^T w(\theta^{(t)},k^{(t)})h(\theta^{(t)}), & \mbox{where} && W = \sum_{t=1}^T w(\theta^{(t)},k^{(t)}), \end{align} and $w(\theta,k) = \pi(\theta)/\pi(\theta)^k = \pi(\theta)^{1-k}$. It can lead to a very poor estimator, even compared to ST, as will be demonstrated empirically in the examples to follow shortly. Observe that we can equivalently write \begin{align} \hat{h}_{\lambda} &= \sum_{i=1}^m \sum_{j=1}^{T_i} w_{ij}^{\lambda}h(\theta_{ij}), && \mbox{where} & w_{ij}^{\lambda} &= \lambda_iw_{ij}/W_i. \label{eq:wlambda} \end{align} Let $\mathbf{w}^{\lambda} = (w_{11}^\lambda,\ldots,w_{1T_1}^\lambda,w_{21}^\lambda,\ldots,w_{2T_2}^\lambda, \ldots,w_{m1}^\lambda,\ldots,w_{mT_m}^\lambda)$. Attempting to choose $\lambda_1,\dots,\lambda_m$ to minimize $\mbox{Var}(\hat{h}_\lambda)$ directly can be difficult. Moreover, for the applications that we have in mind, it is important that our estimator can be constructed without knowledge of the normalizing constants of $\pi_{k_1},\ldots,\pi_{k_m}$, and without evaluating the MH transition kernels $\mathcal{K}_{\pi_{k_i}}(\cdot,\cdot)$. It is for this reason that methods like the \emph{balance heuristic} \cite{veach:1995}, MCV \cite{owen:2000}, or population Monte Carlo (PMC) \cite{douc:etal:2007} cannot be applied. Instead, we seek maximize the effective sample size of $\hat{h}_\lambda$ in (\ref{eq:hhatlambda}), and look for an $O(T)$ operation to determine the optimal $\lambda^*$. %\begin{thm} %\label{thm:lambdastar} Among estimators of the form~(\ref{eq:hhatlambda}), it can be shown \cite{gra:samw:king:2009} that $\mathrm{ESS}(\mathbf{w}^{\lambda})$ is maximized by $\lambda = \lambda^*$, where, for $i=1,\ldots,m$, \begin{align*} \lambda_i^* &= \frac{\ell_i}{\sum_{i=1}^m \ell_i}, & \mbox{and} && \ell_i &= \frac{W_i^2}{\sum_{j=1}^{T_i} w_{ij}^2}. \end{align*} The efficiency of each IS estimator $\hat{h}_i$ can be measured through $\mathrm{ESS}(\mathbf{w}_i)$. Intuitively, we hope that with a good choice of $\lambda$, the ESS (\ref{eq:essw}) of $\hat{h}_{\lambda}$, would be close to the sum over $i$ of the effective sample sizes each of $\hat{h}_i$. This is indeed the case for $\hat{h}_{\lambda^*}$, because it can be shown \cite{gra:samw:king:2009} that \[ \mathrm{ESS}(\mathbf{w}^{\lambda^*}) \geq \sum_{i=1}^m \mathrm{ESS}(\mathbf{w}_i) - \frac{1}{4} - \frac{1}{T}. \] In practice we have found that this bound is conservative and that in fact $\mathrm{ESS}(\mathbf{w}^{\lambda^*}) \geq \sum_{i=1}^m \mathrm{ESS}(\mathbf{w}_i)$, as will be shown empirically in the examples that follow. Thus our optimally--combined IS estimator has a highly desirable and intuitive property in terms of its effective sample size: that the whole is greater than the sum of its parts. $\mathrm{ESS}(\mathbf{w}^{\lambda^*})$ depends on $\mathrm{ESS}(\mathbf{w}_i)$ which in turn depend on the $k_i$. Smaller $k_i$ will lead to better mixing in the Markov chain, but lower $\mathrm{ESS}(\mathbf{w}_i)$. Therefore, we can expect that the geometric and sigmoidal ladders will fare better than the harmonic ones, so long as the desired improvements in mixing are achieved. In the examples to follow, we shall see that the sigmoidal ladder does indeed leader to higher $\mathrm{ESS}(\mathbf{w}^{\lambda^*})$. \subsection{Examples} \label{sec:examples} Here the IT method is shown in action for {\tt tgp} models. IT is controlled in {\tt b*} functions via the {\tt itemps} argument: a {\tt data.frame} coinciding with the output of the {\tt default.itemps} function. The {\tt lambda} argument to {\tt default.itemps} can be used to base posterior predictive inference the other IT heuristics: ST and the na\"ive approach (\ref{Eq:hath}). Whenever the argument {\tt m = 1} is used with {\tt k.min != 1} the resulting estimator is constructed via tempered importance sampling at the single inverse temperature {\tt k.min}, in the style of Jennison~\cite{jennison:1993} as outlined in Section \ref{sec:temp}. The parameters $c_0$ and $n_0$ for stochastic approximation of the pseudo--prior can be specified as a 2--vector {\tt c0n0} argument to {\tt default.itemps}. In the examples which follow we simply use the default configuration of the IT method, adjusting only the minimum inverse temperature via the {\tt k.min} argument. Before delving into more involved examples, we illustrate the stages involved in a small run of importance tempering (IT) on the exponential data from Section 3.3 of \cite{gramacy:2007}. The data can be obtained as: <<>>= exp2d.data<-exp2d.rand() X<-exp2d.data$X Z<-exp2d.data$Z @ Now, consider applying IT to the Bayesian treed LM with a small geometric ladder. A warning will be given if the default setting of \verb!bprior="bflat"! is used, as this (numerically) improper prior can lead to improper posterior inference at high temperatures. <<>>= its <- default.itemps(m=10) exp.btlm <- btlm(X=X,Z=Z, bprior="b0", R=2, itemps=its, pred.n=FALSE, BTE=c(1000,3000,2)) @ Notice how the MCMC inference procedure starts with $B+T=\Sexpr{exp.btlm$BTE[1] + exp.btlm$BTE[2]}$ rounds of stochastic approximation (initial adjustment of the pseudo--prior) in place of typical (default) the $B=\Sexpr{exp.btlm$BTE[1]}$ burn--in rounds. Then, the first round of sampling from the posterior commences, over $T=\Sexpr{exp.btlm$BTE[2]-exp.btlm$BTE[1]}$ rounds, during which the observation counts in each temperature are tallied. The progress meter shows the current temperature the chain is in, say {\tt k=0.629961}, after each of 1000 sampling rounds. The first repeat starts with a pseudo--prior that has been adjusted by the observation counts, which continue to be accumulated throughout the entire procedure (i.e., they are never reset). Any subsequent repeats begin after a similar (re-)adjustment. Before finishing, the routine summarizes the sample size and effective sample sizes in each rung of the temperature ladder. The number of samples is given by {\tt len}, and the ESS by {\tt ess}. These quantities can also be recovered via {\tt traces}, as shown later. The ESS of the optimal combined IT sample is the last quantity printed. This, along with the ESS and total numbers of samples in each temperature, can also be obtained via the {\tt tgp}-class output object. <<>>= exp.btlm$ess @ \subsubsection{Motorcycle accident data} \label{sec:moto} Recall the motorcycle accident data of Section 3.4 of the first {\tt tgp} vignette \cite{gramacy:2007}. Consider using IT to sample from the posterior distribution of the treed GP LLM model using the geometric temperature ladder. <<>>= library(MASS) moto.it <- btgpllm(X=mcycle[,1], Z=mcycle[,2], BTE=c(2000,52000,10), bprior="b0", R=3, itemps=geo, trace=TRUE, pred.n=FALSE, verb=0) @ Out of a total of $\Sexpr{moto.it$R*moto.it$BTE[2]/moto.it$BTE[3]}$ samples from the joint chain, the resulting (optimally combined) ESS was: <<>>= moto.it$ess$combined @ Alternatively, $\mb{w}^{\lambda^*}$ can be extracted from the traces, and used to make the ESS calculation directly. <<>>= p <- moto.it$trace$post ESS(p$wlambda) @ The unadjusted weights $\mb{w}$ are also available from {\tt trace}. We can see that the na\"{i}ve choice of $\lambda_i = W_i/W$, leading to the estimator in (\ref{Eq:hath}), has a clearly inferior effective sample size. <<>>= ESS(p$w) @ To see the benefit of IT over ST we can simply count the number of samples obtained when $k^{(t)} = 1$. This can be accomplished in several ways: either via the traces or through the output object. <<>>= as.numeric(c(sum(p$itemp == 1), moto.it$ess$each[1,2:3])) @ That is, (optimal) IT gives effectively $\Sexpr{signif(moto.it$ess$combined/sum(p$itemp==1), 3)}$ times more samples. The na\"{i}ve combination, leading to the estimator in (\ref{Eq:hath}), yields an estimator with an effective sample size that is $\Sexpr{round(100*ESS(p$w)/sum(p$itemp==1))}$\% of the number of samples obtained under ST. Now, we should like to compare to the MCMC samples obtained under the same model, without IT. <<>>= moto.reg <- btgpllm(X=mcycle[,1], Z=mcycle[,2], BTE=c(2000,52000,10), R=3, bprior="b0", trace=TRUE, pred.n=FALSE, verb=0) @ The easiest comparison to make is to look at the heights explored under the three chains: the regular one, the chain of heights visited at all temperatures (combined), and those obtained after applying IT via re-weighting under the optimal combination $\lambda^*$. <<>>= L <- length(p$height) hw <- suppressWarnings(sample(p$height, L, prob=p$wlambda, replace=TRUE)) b <- hist2bar(cbind(moto.reg$trace$post$height, p$height, hw)) @ \begin{figure}[ht!] <>= barplot(b, beside=TRUE, col=1:3, xlab="tree height", ylab="counts", main="tree heights encountered") legend("topright", c("reg MCMC", "All Temps", "IT"), fill=1:3) @ <>= graphics.off() @ \centering \includegraphics[trim=0 25 0 25]{tgp2-it-moto-height} \caption{Barplots indicating the counts of the number of times the Markov chains (for regular MCMC, combining all temperatures in the inverse temperature ladder, and those re-weighted via IT) were in trees of various heights for the motorcycle data.} \label{f:moto:it:heights} \end{figure} Figure \ref{f:moto:it:heights} shows barplots indicating the count of the number of times the Markov chains were in trees of various heights after burn--in. Notice how the tempered chain (denoted ``All Temps'' in the figure) frequently visits trees of height one, whereas the non--tempered chain (denoted ``reg MCMC'') never does. The result is that the non--tempered chain underestimates the probability of height two trees and produces a corresponding overestimate of height four trees---which are clearly not supported by the data---even visiting trees of height five. The IT estimator appropriately down--weights height one trees and provides correspondingly more realistic estimates of the probability of height two and four trees. Whenever introducing another parameter into the model, like the inverse temperature $k$, it is important to check that the marginal posterior chain for that parameter is mixing well. For ST it is crucial that the chain makes rapid excursions between the cold temperature, the hottest temperatures, and visits each temperature roughly the same number of times. \begin{figure}[ht!] <>= plot(log(moto.it$trace$post$itemp), type="l", ylab="log(k)", xlab="samples", main="trace of log(k)") @ <>= graphics.off() @ \centering \includegraphics[trim=0 25 0 25]{tgp2-it-moto-ktrace} \caption{A trace of the MCMC samples from the marginal posterior distribution of the inverse temperature parameter, $k$, in the motorcycle experiment} \label{f:ktrace} \end{figure} Figure \ref{f:ktrace} shows a trace of the posterior samples for $k$ in the motorcycle experiment. Arguably, the mixing in $k$--space leaves something to be desired. Since it can be very difficult to tune the pseudo--prior and MH proposal mechanism to get good mixing in $k$--space, it is fortunate that the IT methodology does not rely on the same mixing properties as ST does. Since samples can be obtained from the posterior distribution of the parameters of interest by re-weighting samples obtained when $k < 1$ it is only important that the chain frequently visit low temperatures to obtain good sampling, and high temperatures to obtain good mixing. The actual time spent in specific temperatures, i.e., $k=1$ is less important. %%ylim <- c(0, 1.25*max(c(b[,1], moto.it$itemps$counts))) %, ylim=ylim) \begin{figure}[ht!] <>= b <- itemps.barplot(moto.it, plot.it=FALSE) barplot(t(cbind(moto.it$itemps$counts, b)), col=1:2, beside=TRUE, ylab="counts", xlab="itemps", main="inv-temp observation counts") legend("topleft", c("observation counts", "posterior samples"), fill=1:2) @ <>= graphics.off() @ \centering \includegraphics[trim=0 25 0 25]{tgp2-it-moto-khist} \caption{Comparing (thinned) samples from the posterior distribution for the inverse temperature parameter, $k$, (posterior samples), to the observation counts used to update the pseudo--prior, in the motorcycle experiment} \label{f:khist} \end{figure} Figure \ref{f:khist} shows the histogram of the inverse temperatures visited in the Markov chain for the motorcycle experiment. Also plotted is a histogram of the {\em observation counts} in each temperature. The two histograms should have similar shape but different totals. Observation counts are tallied during every MCMC sample after burn--in, whereas the posterior samples of $k$ are thinned (at a rate specified in {\tt BTE[3]}). When the default {\tt trace=FALSE} argument is used only the observation counts will be available in the {\tt tgp}--class object, and these can be used as a surrogate for a trace of $k$. The compromise IT approach obtained using the sigmoidal ladder can yield an increase in ESS. <<>>= moto.it.sig <- btgpllm(X=mcycle[,1], Z=mcycle[,2], BTE=c(2000,52000,10), R=3, bprior="b0", krige=FALSE, itemps=sig, verb=0) @ Compare the resulting ESS to the one given for the geometric ladder above. <<>>= moto.it.sig$ess$combined @ \begin{figure}[ht!] <>= plot(moto.it.sig) @ <>= graphics.off() @ \centering \includegraphics[trim=0 25 0 25]{tgp2-it-moto-pred} \caption{Posterior predictive surface for the motorcycle data, with 90\% quantile errorbars, obtained under IT with the sigmoidal ladder.} \label{f:moto:pred} \end{figure} Plots of the resulting predictive surface is shown in Figure \ref{f:moto:pred} for comparison with those in Section 1.1 of the first {\tt tgp} vignette \cite{gramacy:2007}. In particular, observe that the transition from the middle region to the right one is much less stark in this tempered version than than in the original---which very likely spent a disproportionate amount of time stuck in a posterior mode with trees of depth three or greater. \subsubsection{Synthetic 2--d Exponential Data} \label{sec:exp} Recall the synthetic 2--d exponential data of Section 3.4 of the tgp vignette \cite{gramacy:2007}, where the true response is given by \[ z(\mb{x}) = x_1 \exp(-x_1^2 - x_2^2). \] Here, we will take $\mb{x} \in [-6,6]\times [-6,6]$ with a $D$--optimal design <<>>= Xcand <- lhs(10000, rbind(c(-6,6),c(-6,6))) X <- dopt.gp(400, X=NULL, Xcand)$XX Z <- exp2d.Z(X)$Z @ Consider a treed GP LLM model fit to this data using the standard MCMC. <<>>= exp.reg <- btgpllm(X=X, Z=Z, BTE=c(2000,52000,10), bprior="b0", trace=TRUE, krige=FALSE, R=10, verb=0) @ \begin{figure}[ht!] <>= plot(exp.reg) @ <>= graphics.off() @ \centering \includegraphics[trim=0 25 0 25]{tgp2-it-exp-pred} \caption{Posterior predictive surface for the 2--d exponential data: mean surface {\em (left)} and 90\% quantile difference {\em (right)}} \label{f:exp:pred} \end{figure} Figure \ref{f:exp:pred} shows the resulting posterior predictive surface. The maximum {\em a' posteriori} (MAP) tree is drawn over the error surface in the {\em right--hand} plot. The height of this tree can be obtained from the {\tt tgp}-class object. <<>>= h <- exp.reg$post$height[which.max(exp.reg$posts$lpost)] h @ It is easy to see that many fewer partitions are actually necessary to separate the interesting, central, region from the surrounding flat region. \begin{figure}[ht!] <>= tgp.trees(exp.reg, "map") @ <>= graphics.off() @ \centering \includegraphics[trim=0 100 0 25]{tgp2-it-exp-mapt} \caption{Diagrammatic depiction of the maximum {\em a' posteriori} (MAP) tree for the 2--d exponential data under standard MCMC sampling } \label{f:exp:mapt} \end{figure} Figure \ref{f:exp:mapt} shows a diagrammatic representation of the MAP tree. Given the apparent over--partitioning in this height \Sexpr{h} tree it would be surprising to find much posterior support for trees of greater height. One might indeed suspect that there are trees with fewer partitions which would have higher posterior probability, and thus guess that the Markov chain for the trees plotted in these figures possibly became stuck in a local mode of tree space while on an excursion into deeper trees. Now consider using IT. It will be important in this case to have a $k_{\mbox{\tiny m}}$ small enough to ensure that the tree occasionally prunes back to the root. We shall therefore use a smaller $k_{\mbox{\tiny m}}$. % with an extra 10 rungs. Generally speaking, some pilot tuning may be necessary to choose an appropriate $k_{\mbox{\tiny m}}$ and number of rungs $m$, although the defaults should give adequate performance in most cases. <<>>= its <- default.itemps(k.min=0.02) exp.it <- btgpllm(X=X, Z=Z, BTE=c(2000,52000,10), bprior="b0", trace=TRUE, krige=FALSE, itemps=its, R=10, verb=0) @ As expected, the tempered chain moves more rapidly throughout tree space by accepting more tree proposals. The acceptance rates of tree operations can be accessed from the {\tt tgp}--class object. <<>>= exp.it$gpcs exp.reg$gpcs @ The increased rate of {\em prune} operations explains how the tempered distributions helped the chain escape the local modes of deep trees. We can quickly compare the effective sample sizes of the three possible estimators: ST, na\"{i}ve IT, and optimal IT. <<>>= p <- exp.it$trace$post data.frame(ST=sum(p$itemp == 1), nIT=ESS(p$w), oIT=exp.it$ess$combined) @ Due to the thinning in the Markov chain ({\tt BTE[3] = 10}) and the traversal between $m=10$ temperatures in the ladder, we can be reasonably certain that the \Sexpr{round(exp.it$ess$combined)} samples obtained via IT from the total of \Sexpr{round(exp.it$R*(exp.it$BTE[2]-exp.it$BTE[1])/exp.it$BTE[3])} samples obtained from the posterior are far less correlated than the ones obtained via standard MCMC. As with the motorcycle data, we can compare the tree heights visited by the two chains. <<>>= L <- length(p$height) hw <- suppressWarnings(sample(p$height, L, prob=p$wlambda, replace=TRUE)) b <- hist2bar(cbind(exp.reg$trace$post$height, p$height, hw)) @ \begin{figure}[ht!] <>= barplot(b, beside=TRUE, col=1:3, xlab="tree height", ylab="counts", main="tree heights encountered") legend("topright", c("reg MCMC", "All Temps", "IT"), fill=1:3) @ <>= graphics.off() @ \centering \includegraphics[trim=0 25 0 25]{tgp2-it-exp-height} \caption{Barplots indicating the counts of the number of times the Markov chains (for regular MCMC, combining all temperatures in the inverse temperature ladder, and those re-weighted via IT) were in trees of various heights for the 2--d exponential data.} \label{f:exp:it:heights} \end{figure} Figure \ref{f:exp:it:heights} shows a barplot of {\tt b}, which illustrates that the tempered chain frequently visited shallow trees. IT with the optimal weights shows that the standard MCMC chain missed many trees of height three and four with considerable posterior support. \begin{figure}[ht!] <>= ylim <- range(p$height, exp.reg$trace$post$height) plot(p$height, type="l", main="trace of tree heights", xlab="t", ylab="height", ylim=ylim) lines(exp.reg$trace$post$height, col=2) legend("topright", c("tempered", "reg MCMC"), lty=c(1,1), col=1:2) @ <>= graphics.off() @ \centering \includegraphics[trim=0 25 0 25]{tgp2-it-exp-trace-height} \caption{Traces of the tree heights obtained under the two Markov chains (for regular MCMC, combining all temperatures in the inverse temperature ladder) on the 2--d exponential data.} \label{f:exp:trace:height} \end{figure} To more directly compare the mixing in tree space between the ST and tempered chains, consider the trace plots of the heights of the trees explored by the chains shown in Figure \ref{f:exp:trace:height}. Despite being restarted \Sexpr{exp.reg$R} times, the regular MCMC chain (almost) never visits trees of height less than five after burn--in and instead makes rather lengthy excursions into deeper trees, exploring a local mode in the posterior. In contrast, the tempered chain frequently prunes back to the tree root, and consequently discovers posterior modes in tree heights three and four. \begin{figure}[ht!] <>= plot(exp.it) @ \vspace{-0.7cm} <>= tgp.trees(exp.it, "map") @ <>= graphics.off() @ \centering \includegraphics[trim=0 15 0 0]{tgp2-it-expit-pred} \includegraphics[trim=0 100 0 0]{tgp2-it-expit-trees} \caption{2--d exponential data fit with IT. {\em Top:} Posterior predictive mean surface for the 2d--exponential, with the MAP tree overlayed. {\em Bottom:} diagrammatic representation of the MAP tree. } \label{f:exp-it:pred} \end{figure} To conclude, a plot of the posterior predictive surface is given in Figure \ref{f:exp-it:pred}, where the MAP tree is shown both graphically and diagrammatically. %\iffalse \subsection*{Acknowledgments} This work was partially supported by research subaward 08008-002-011-000 from the Universities Space Research Association and NASA, NASA/University Affiliated Research Center grant SC 2003028 NAS2-03144, Sandia National Laboratories grant 496420, National Science Foundation grants DMS 0233710 and 0504851, and Engineering and Physical Sciences Research Council Grant EP/D065704/1. The authors would like to thank their Ph.D.~advisor, Herbie Lee, whose contributions and guidance in this project have been invaluable throughout. Finally, we would like to thank two anonymous referees whose many helpful comments improved the paper. %\fi \bibliography{tgp} \bibliographystyle{plain} \end{document}