Introduction

This note serves as an overview of some of the algorithms used in Bayesian statistics, starting with an introduction to Markov chains and Monte Carlo methods; the two fundamental components that make up a Markov Chain Monte Carlo (MCMC) process. First, we talk about the components of an MCMC process: Markov chains and Monte Carlo methods. Then we use these tools understand how the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms explore the posterior distribution.

Random Walks

A random walk is a special case of a Markov chain. It does precisely what you think it does; randomly move through some mathematical space. The canonical example is a particle moving in one dimension. With equal probability the particle moves up or down at time \(t\) relative to it position at time \(t-1\). This behavior can be represented with a Bernoulli probability mass function. Another example would be to let \(\varepsilon ~ \mathcal{N}(0,1)\) and define, \[ X_n = X_{n-1} + \varepsilon \] which results in a continuous one-dimensional random walk.

The code below computes a discrete random walk in two dimensions with equal probabilities being assigned to each movement: up, down, left, or right. It uses the sample() function but rbinom() could have been used instead by defining \((1,2,3,4) \equiv (\mbox{up, down, left, right})\).

# RANDOM WALK

loc <- c("up", "down", "left", "right")
dat <- matrix(NA, ncol = 2, nrow = 1e5/2)
dat[1,] <- c(0,0)

for(i in 2:nrow(dat)) {
  rand <- sample(loc, 1, prob = rep(0.25,4))
  if(rand == "up") {
    dat[i,] <- c(0,1) + dat[i-1,]
  }
  else if(rand == "down") {
    dat[i,] <- c(0,-1) + dat[i-1,]
  }
  else if(rand == "left") {
    dat[i,] <- c(1,0) + dat[i-1,]
  }
  else if(rand == "right") {
    dat[i,] <- c(-1,0) + dat[i-1,]
  }
  else {
    dat[i,] <- c(NA,NA)
  }
}

Below, the figure on the left plots the first 5,000 steps of the random walk, and the figure on the right plots all 50,000 steps of the walk with the first 5,000 steps highlighted. The darker regions indicate the parts of the process that walked over itself.

Markov Chains

Consider a sequence of random variables \(X_0, X_1, X_2, \ldots\) where \(X_n\) denotes the state at time \(n \geq 0\). If the sequence is a Markov chain then the the probability of \(X_n\) depends only on \(X_{n-1}\). Thus, we can write the conditional probability as, \[ P(X_{n} = j | X_{n-1} = i) \] where \(j\) and \(i\) are values from the state space. The conditional probability above is known as the transition probability of moving from state \(i\) in time \(n-1\) to state \(j\) in time \(n\).

The mathematical statement above is best summarized with the Markov property, which states that only the most recent term \(X_{n-1}\) of all the past values \(X_{n-1}, X_{n-2}, X_{n-3}, \ldots, X_0\) influences the prediction of the future state \(X_n\). This also means that \(X_n\) and \(X_{n-2}, X_{n-3}, \ldots, X_0\) are conditionally independent. One useful feature of Markov chains is that it saves us from having to use the entire history of states of the chain to predict the next state.

To simulate a Markov chain we need to compute the probability of moving from one state to the next. These probabilities can be encoded in a transition matrix, the \((i,j)\) element of which defines the probability of moving to state \(j\) from state \(i\) for all \(i,j = 1, 2, \ldots, K\).A transition matrix is square since, given the present state, it must describe the probability of moving to every state. Additionally, the probabilities of moving to each state are disjoint (e.g. if our states are directions then it’s not possible to both move up and down). As a result the rows of the transition matrix sum to one (i.e. the states represent all the events in the sample space.)

Now we want to describe how to actually make the move from state \(n-1\) to state \(n\). Let the \(1 \times K\) vector \(\vec{v}_n\) denote the probabilites of the state at \(n\) and let \(M\) denote the \(K \times K\) transition matrix. Then the marginal distribution of each state is given applying the law of total probability in the following way, \[ \vec{v}_{n} = \vec{v}_{n-1} M \] Each element \(j\) of \(\vec{v}_n\), is equal to \(\sum_{i=1}^{K} (\vec{v}_{n-1})_i \cdot M_{i,j}\). In other words the \(j^{th}\) element of \(\vec{v}_n\) is the dot product of \(\vec{v}_{n-1}\) and the \(j^{th}\) column of \(M\).

Intutively this makes sense. We start with a vector of probabilities describing our previous state and a transition matrix describing the move to the next state. Each column of the transition matrix defines a \(K\) vector of probabilities associated with moving to a specific state \(j\) given each of the previous states the chain could be at \(i = 1, 2, \ldots, K\). Then, applying the law of total probability, the sum of the product of the probability of being at each previous state \(i\) and the associated probability of moving to the next state \(j\) (given by the \((i,j)\) element of the transition matrix) will give the marginal probability of moving to the next state \(j\).

Similar to the random walk example above, we simulate two Markov chains moving through two dimensional space with the movements: up, down, left, and right.

The first Markov chain emulates a random walk and uses the following transition matrix, \[ M = \begin{bmatrix} 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ \end{bmatrix} \]

where each column of \(M\) denotes the state probabilities for moving to state \(j\) at time \(n\) given the probabilities at state \(i \in (\mbox{up, down, left, right})\) at time \(n-1\), and each row \(M\) denotes the probability of moving to each new state \(j \in (\mbox{up, down, left, right})\) given state \(i\) at time \(n-1\). This is a Markov chain as a random walk since equal weight has been put on moving from state \(i\) to state \(j\) for all \(j \in (\mbox{up, down, left, right})\). For example, given that the chain’s previous state was up, the probability of moving up is \(M_{1,1} = 0.25\) which is equal to the probability of moving down \(M_{1,2} = 0.25\) (and is also equal to the probability of moving left/right).

The second Markov chain uses the following transition matrix, \[ M = \begin{bmatrix} 0 & 0.9 & 0 & 0.1\\ 0.9 & 0 & 0.1 & 0\\ 0.1 & 0 & 0 & 0.9\\ 0 & 0.1 & 09 & 0\\ \end{bmatrix} \] This type of transition matrix encourages the Markov chain to move in the opposite direction compared to its previous state. For example, if our previous state was moving up then (going across the top row of \(M\)) the next state is up with probability 0, down with probability 0.9, left with probability 0, and right with probability 0.1. From this point we will most likely move down. If so then (going across the second row of \(M\)) we move back up with probability 0.9 and left with probability 0.1. But, given the previous state of moving up, there is a 0.1 chance that we could move right after moving up. Then (going across the last row) we would move up with probability 0, down with probability 1, left with probability 0.9, and right with probability 0.

The function that we use to run the Markov chain through is provided below.

# MARKOV CHAIN FUNCTION

#' Construct a Markov Chain governing movement through 2D space
#' @param transition_matrix A symmetric matrix of probabilities. Rows must sum to one.
#' @param iter Number of steps to take over the 2D space.
#' @param init Initial position in the 2D space.
#' @param state Initial state values to start the Markov Chain.
#' @return The probabilities (prob) that are used to calculate the movement (move) and the data (dat) that
#' describes the movemement over 2D space.
markov_chain <- function(transition_matrix, iter = 1e3, init = c(0,0), state = c(1,0,0,0)) {
  # setup containers
  dat <- array(NA, c(iter, 2))  
  prob <- array(NA, c(iter,4))
  move <- array(NA, iter)
  # initial parameterizations
  loc <- colnames(transition_matrix)
  dat[1,] <- init
  # loop to contstruct the chain
  for(i in 2:iter) {
    state <- state%*%transition_matrix
    rand <- sample(loc, 1, prob = state)

    prob[i,] <- c(state)
    move[i] <- c(rand)

    if(rand == "up") {
      dat[i,] <- c(0,1) + dat[i-1,]
    }
    else if(rand == "down") {
      dat[i,] <- c(0,-1) + dat[i-1,]
    }
    else if(rand == "left") {
      dat[i,] <- c(-1,0) + dat[i-1,]
    }
    else if(rand == "right") {
      dat[i,] <- c(1,0) + dat[i-1,]
    }
    else {
      dat[i,] <- c(NA,NA)
    }
  }
  # return containers as a list
  return(list("dat" = dat, "prob" = prob, "move" = move))
}

Now we run the simulations using each transition matrix with the initial state values \(v_0 = (1, 0, 0, 0)\).

# MARKOV CHAIN (as a random walk)
trans_prob <- matrix(0.25, ncol = 4, nrow = 4)
colnames(trans_prob) <- row.names(trans_prob) <- c("up", "down", "left", "right")
iter <- 1e5
mc_rw <- markov_chain(trans_prob, iter = iter)

# MARKOV CHAIN (with a negating transition matrix)
trans_prob <- rbind(c(0, 0.9, 0, 0.1),
                    c(0.9, 0, 0.1, 0),
                    c(0.1, 0, 0, 0.9),
                    c(0, 0.1, 0.9, 0))
colnames(trans_prob) <- row.names(trans_prob) <- c("up", "down", "left", "right")
mc <- markov_chain(trans_prob, iter = iter)

The plot below overlays the “negating” Markov chain with the random walk Markov chain.

Below are the probabilities and associated move (via sampling from the binomial distribution) of the first 10 steps of the negating Markov chain.

##     x  y        up      down       left      right  move
## 1   0  0        NA        NA         NA         NA  <NA>
## 2   0 -1 0.0000000 0.9000000 0.00000000 0.10000000  down
## 3   0  0 0.8100000 0.0100000 0.18000000 0.00000000    up
## 4   0 -1 0.0270000 0.7290000 0.00100000 0.24300000  down
## 5   0  0 0.6562000 0.0486000 0.29160000 0.00360000    up
## 6   0 -1 0.0729000 0.5909400 0.00810000 0.32806000  down
## 7  -1 -1 0.5326560 0.0984160 0.35434800 0.01458000  left
## 8  -1  0 0.1240092 0.4808484 0.02296360 0.37217880    up
## 9  -1 -1 0.4350599 0.1488262 0.38304576 0.03306816  down
## 10  0 -1 0.1722481 0.3948607 0.04464396 0.38824718 right

Monte Carlo Methods

Monte Carlo methods can be used as a way to apply random sampling to numerically solve a problem (e.g. the area under a a curve). This is especially useful if the solution is not analytically tractable. An overview of the algorithm is outlined below,

ALGORITHM (Monte Carlo Method)

  1. Define the domain of the function’s inputs.
  2. Sample from the domain according to some probability distribution (e.g. uniform).
  3. Perform a deterministic computation using (2) as an input.
  4. Accept or reject the value proposed in (2) based on some criterion that involves the value from (3).
  5. Aggregate the results.

We apply this method below to approximate \(\pi\) and in rejection sampling.

Estimating \(\pi\)

Using Monte Carlo methods we can estimate the mathematical constant \(\pi\), which is defined as the ratio of the circumference to the diameter of a circle. Consider a unit circle inscribed in a square (i.e. a circle with radius \(r=1\)). The area of the circle is \(\pi r^2 = \pi\) and the area of the square is 4, so the ratio of the area of the circle and the square is \(\pi/4\). Now all we need to do is construct a Monte Carlo method to estimate this ratio and multiply the result by 4 to get an estimate of \(\pi\).

  1. Define the domain as \(x\in[0,1]\) and the range as \(y\in[0,1]\).
  2. Sample x and y with \(x,y\sim Unif(0,1)\).
  3. Compute \(\alpha = \sqrt{r^2 - x^2}\) where \(r = 1\) is the radius of the circle.
  4. If \(\alpha \leq y\) then accept, else reject.
  5. Compute the ratio \(N_{accept}/(N_{accept} + N_{reject})\), where \(N_{accept}\) and \(N_{reject}\) are the total number of accepted and rejected values, resepectively.
  6. Multiply (6) by 4 to get an estimate of \(\pi\).

Below is the code that runs through the steps above to estimate \(\pi\).

circle <- function(x) {
  sqrt(1 - x^2)
}

domain <- seq(0,1, length.out = 1e3)
iter <- 1e4 * 2
accept <- matrix(NA, ncol = 2, nrow = iter)
reject <- matrix(NA, ncol = 2, nrow = iter)
for(i in 1:iter) {
  x <- runif(1, 0, 1)
  y <- runif(1, 0, 1)
  if(x^2 + y^2 <= 1)
    accept[i,] <- c(x,y)
  else
    reject[i,] <- c(x,y)
}
cat('estimate of pi = ', (nrow(na.omit(accept)) / iter) * 4)
## estimate of pi =  3.137

The estimate is close, but not without some numerical error. The figure below illustrates the accepted (pink) and rejected (grey) values as well as the deterministic function that defines a circle.

Rejection Sampling

Rejection sampling is a Monte Carlo method that enables you to sample from a probability distribution using the distribution’s probability density/mass function and the uniform distribution. Let \(f\) be the target distribution (i.e. the distribution you want to get samples from) and let \(g\) be the candidate distribution (i.e. the distribution you use to randomly sample out of the domain of interest). We need these distributions to satisfy two conditions: 1) both \(g\) and \(f\) need to be defined over the same support, 2) from all values in the support we need some constant \(M\) such that \(f(x)/g(x) \leq M\) . Here, \(M\) is interpreted as the bound on the likelihood ratio. The algorithm proceeds as follows,

ALGORITHM (Rejection Sampling)

  1. Generate \(y\) from the proposal distribution \(g\).
  2. Calculate the likelihood ratio \(\alpha = f(y)/(Mg(y))\).
  3. Generate a stochastic condition \(u ~ Unif(a,b)\), where the support of \(f\) is \([a,b]\).
  4. If \(\alpha > u\) accept \(y\), else reject.
# MONTE CARLO - REJECTION SAMPLING

#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param g Candidate distribution.
#' @param support Support of the target distribution.
#' @param M Bound on the likelihood ratio f/g.
#' @param iter Number of iterations to run the algorithm through (iter = accept + reject).
#' @return A list containing accepted/rejected values.
accept_reject <- function(f, g, support = c(0,1), M = 1, iter = 1e3) {
  out <- list()

  for (i in 1:iter) {
    # containers
    y <- runif(1, support[1], support[2])  # proposal
    u <- runif(1, 0, 1)                    # stochastic condition
    out$y[i] <- y
    out$u[i] <- u
    # likelihood ratio
    ratio <- f(y)/(M*g(y, support[1], support[2]))
    # accept-reject condition
    if(ratio > u)
      out$accept[i] <- y
    else
      out$reject[i] <- y
  }
  return(out)
}

The code below applies rejection sampling to sample from the beta distribution using different values of \(M\).

beta_sample1 <- accept_reject(f = function(x){dbeta(x, 2, 2)},
                             g = function(x, a, b){dunif(x, a, b)},
                             M = 1, iter = 1e4)
beta_sample2 <- accept_reject(f = function(x){dbeta(x, 2, 2)},
                              g = function(x, a, b){dunif(x, a, b)},
                              M = 2, iter = 1e4)

The table below describes the total number of accepted (TRUE) and rejected (FALSE) values in each sampling procedure.

##        M=1  M=2
## FALSE 1981 5072
## TRUE  8016 4928

Notice that with a low value of \(M\) we ended up accepting too many values. In fact, what ended up happening is that we over accepted in the tails of the beta distribution, making it appear as though the distribution has been squashed from above. This is apparent in the center plot of the figure below which illustrates the accepted (pink) and rejected (blue) values. With an inappropriate likelihood ratio bound (e.g. \(M=1\)), the empirical beta distribution over accepts in the tails. With a more adequate value for the bound (e.g. \(M=2\)) we get a better approximation of the true beta distribution. (For comparison purposes, the figure on the far left plots samples from the \(Beta(2,2)\) distribution using the R function rbeta().)

If we state our acceptance rule as \(u \leq f(y)/Mg(y) \implies uMg(y) \leq f(y)\) then why this is happening should be clear. If \(M_1 < M_2\) then for any given \(u\) and \(y\) we have \(uM_1g(y) < uM_2g(y)\) which means that the condition \(uM_1g(y) \leq f(y)\) would more likely accept \(y\) compared to \(uM_2g(y) \leq f(y)\).

Metropolis

Now we want to combine these two aforementioned ideas of Markov chains and Monte Carlo methods. If we use a Markov chain to move through some mathematical space and sample according to some Monte Carlo method then what we have is called Markov chain Monte Carlo (MCMC). The Metropolis algorithm is a simple case of this. Again, we have a target distribution \(f\) that we are interested in sampling from and a symmetric proposal (or candidate) distribution \(g\) that we generate proposals from. Let the index \(n\) denote the current position of the chain. The algorithm proceeds as follows,

ALGORITHM (Metropolis)

  1. (Repeat 1-4 J times.) Generate a proposal \(x^* \sim g(x_n)\).
  2. Compute the acceptance ratio \(\alpha = f(x^*)/f(x_n)\).
  3. Accept alpha with probability \(min(1,\alpha)\). If \(\alpha < 1\) then we will need to compare it with a stochastic criterion \(u \sim Unif(0,1)\) in order to determine whether we accept/reject the proposal.
  4. If \(x^*\) is accepted then \(x_{n+1} = x^*\), else \(x_{n+1} = x_{n}\).

The code below creates a function that runs the Metropolis algorithm. The user can declare the number of iterations and the number of chains. Both inputs dictate the whether the algorithm samples appropriately. The number of iterations ensures that we obtain a large enough sample from the distribution, and running chains with different initial values allows us to determine whether the samples converge. If they do not converge then we cannot perform any valid inferences on the samples since they do not adequately represent the distribution. Sometimes this can be remedied by increasing the number of iterations.

# METROPOLIS ALGORITHM

#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param iter Number of iterations to run the algorithm through.
#' @param chains Number of chains.
#' @param prop_scale Value of tuning parameter.
#' @return An array of iterations by chains.
metropolis <- function(f, iter = 1e3, chains = 4, prop_scale = 1) {
  init <- rnorm(chains, 0, 20)
  out <- array(NA, c(iter, chains))
  out[1,] <- init

  # define a function to sample (i.e. accpet/reject)
  sampler <- function(x) {
    alpha <- 0
    u <- 1
    while(alpha < u) {
      u <- runif(1, 0, 1)                   # stochastic criterion
      x_star <- rnorm(1, x, 1 * prop_scale) # proposal distribution
      alpha <- exp(f(x_star)-f(x))          # acceptance ratio
    }
    return(x_star)
  }

  # run the sampler
  for(i in 2:iter) {
    for(j in 1:chains) {
      out[i,j] <- sampler(out[i-1,j])       # accepting x* or x_n
    }
  }

  out
}

Before running the algorithm we need to define the posterior distribution that we want to sample from which is done in the code below. For numerical stability we define the log of the posterior distribution.

dat <- rnorm(1, rnorm(10, 5, 5), 1)
post_func <- function(x) {
  lik <- sum(dnorm(dat, x, 1, log = T))
  prior <- dnorm(x, 5, 1, log = T)
  lik + prior
}

Mathematically we can state our model as,

\[ y \sim N(\mu, 1) \\ \mu \sim N(5,1) \]

Below we run the algorithm and look at the mean of the data and the mean of the samples of \(\mu\).

metro <- metropolis(f = post_func, chains = 4, iter = 2e3)
rbind("mean of data" =  mean(dat),
      "mean of met samples" =  mean(metro))
##                         [,1]
## mean of data        6.087906
## mean of met samples 5.467415

It looks like we got closer to the true \(\mu\) compared to the maximum likelihood estimate of \(\mu\) (which would have been approximately equal to the mean of the data).

In the figure below the plot on the top left illustrates the full length of all four chains. Although the chains start in disparate locations, they converge pretty quickly. This is not always the case, especially in complicated models. The initial steps are useful only insofar as they eventually take us to most of the mass of the distribution. We should discard these values since they are not exploring much of the mass distribution (the part that we find interesting). It is conventional to regard the first half of each chain as a warmup (or burnin) for the algorithm and conduct inferences on the last half. Next, the plot on the top right presents the first 100 iterations of each chain, and we can see that the chains actually converge within the first 100 iterations. The plot on the bottom left illustrates the last half of each chain (i.e. the samples that we can conduct valid inferences on). Finally, the plot on bottom right represents a histogram of the sampled values of \(\mu\).

Metropolis-Hastings

The Metropolis-Hastings algorithm is similar to the Metropolis algorithm, however, we now relax the assumption that the proposal distribution \(g\) has to be symmetric. The algorithm proceeds in the same way except that now our accept/reject condition is \(u < \frac{f(x^*)}{f(x_n)}\frac{g(x_n|x^*)}{g(x^*|x_n)}\). The component \(\frac{g(x_n|x^*)}{g(x^*|x_n)}\) is included in the condition to correct for the asymmetry in the proposal distribution. This can improve efficiency since, in some cases, using an asymmetric proposal distribution can increase the speed of the Markov chain.

The code below creates a Metropolis-Hastings function that allows you to sample from a bivariate distribution where the proposal distribution \(g\) is the bivariate normal distribution.

# METROPOLIS-HASTINGS ALGORITHM

#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param iter Number of iterations to run the algorithm through.
#' @param chains Number of chains.
#' @param init Initial values for each chain.
#' @param prop_scale Value for tuning parameter.
#' @return An array of iterations by parameters by chains.
metropolis_hastings <- function(f, iter = 1e3, chains = 4, init = NULL, prop_scale = 1) {
  out <- array(NA, c(iter, 2, chains))
  # setup initial values
  for(p in 1:2) {
    if(is.null(init))
      out[1,p,] <- rnorm(chains, 0, 1)
    else
      out[1,p,] <- init[,p]
  }

  prop_covmat <- diag(1,2)

  # create sampler
  sampler <- function(x) {
    alpha <- 0
    u <- 1
    reject <- -1
    while(alpha < u) {
      reject <- reject + 1
      u <- runif(1, 0, 1)
      x_star <- rmvnorm(1, x, prop_covmat*prop_scale)
      alpha <- exp((f(x_star)-f(x)) +
                     (dmvnorm(x, x_star, prop_covmat, log = TRUE) - dmvnorm(x_star, x, prop_covmat, log = TRUE)))
      # cat("alpha = ", alpha, "; ", "u = ", u, "\n")
    }
    return(list("post" = x_star, "reject" = reject))
  }

  # run sampler for each chain
  for(j in 1:chains) {
    cat(":: chain", j, "::","")
    reject <- 0
    for(i in 2:iter) {
      sampler_stuff <- sampler(out[i-1,,j])
      reject <- reject + sampler_stuff$reject
      out[i,,j] <- sampler_stuff$post  # important! the rest is just print aesthetics
      if(i%%(iter/10) == 0)
        cat(".")
      if(i == iter)
        cat("", "accept rate =", iter,"/", reject + iter, "=", round(iter/(reject+iter),4), "\n")

    }
  }

  out
}

The code below creates the data and runs the Metropolis-Hastings algorithm. Our posterior distribution that we are interested in sampling from can be stated as, \[ y \sim \mathcal{MN}(\vec{\mu}, \mathbf{\Sigma}) \\ \vec{\mu}_1 \sim \mathcal{N}(0, 3) \\ \vec{\mu}_2 \sim \mathcal{N}(0, 3) \\ \]

Where the covariance matrix \(\Sigma\) is known and is a \(2 \times 2\) identity matrix.

library(mvtnorm)

# parameters and data
covmat <- diag(1,2)
N <- 15
priors <- cbind(rnorm(N, 0, 3), rnorm(N, 0, 3))
dat <- array(NA, c(N, 2))
for(i in 1:N) {
  dat[i,] <- rmvnorm(1, mean = priors[i,], sigma = covmat)
}

# initial values for algorithm
init_val <- rbind(c(1,1),
                  c(-1,-1),
                  c(1,-1),
                  c(-1,1))
init_val <- init_val * 4

# posterior distribution
posterior_func <- function(pars) {
  lik <- sum(dmvnorm(dat, pars, sigma = covmat, log = TRUE))
  prior <- dnorm(pars, 0, 3, log = TRUE)
  lik + sum(prior)
}

# run algorithm
metro_hasti <- metropolis_hastings(f = posterior_func,
                                   iter = 2e3, chains = 4, init = init_val,
                                   prop_scale = 0.005)
## :: chain 1 :: .......... accept rate = 2000 / 2357 = 0.8485 
## :: chain 2 :: .......... accept rate = 2000 / 2389 = 0.8372 
## :: chain 3 :: .......... accept rate = 2000 / 2375 = 0.8421 
## :: chain 4 :: .......... accept rate = 2000 / 2381 = 0.84

Below we have the means of each parameter using the data and the sampled values. Again, we get a lot closer to the true parameter values with the Bayesian approach, compared to the maximum likelihood approach.

##                     mu1       mu2
## mean data      1.465844 0.4108625
## mean posterior 1.380427 0.3835569

Now we examine the convergence of the chains. The plot below looks at convergence in the two dimensional space. (We deliberately used starting values in each corner of a square for illustrative purposes.) Eventually the chains reach the part of the posterior distribution that has the most mass and mix there.

The figure below illustrates the traceplots for each chain with the warmup shaded in grey.

Finally, the figure below illustrates the samples for each parameter using histograms.

Our marginal posterior distributions look appropriate given the data generating process, and it looks like our chains converged and mixed well.

Tuning

The Metropolis and Metropolis-Hasting algorithms are sensitive to tuning. Above we used a fairly inefficient tuning parameter prop_scale = 0.005 (this is apparent in the large number of accepted proposals). It is inefficient in the sense that it over accepted proposals out in the tails. (In this example we know that the tails are not a particularly interesting part of the distribution since our likelihood is multivariate normal and our priors are normal.) This isn’t a huge problem it we run a large number of iterations, but that is not always feasible. Below we rerun the model with a more efficient value for the tuning parameter.

metro_hasti_tuned <- metropolis_hastings(f = posterior_func,
                                   iter = 2e3, chains = 4, init = init_val,
                                   prop_scale = 0.1)
## :: chain 1 :: .......... accept rate = 2000 / 4276 = 0.4677 
## :: chain 2 :: .......... accept rate = 2000 / 4331 = 0.4618 
## :: chain 3 :: .......... accept rate = 2000 / 4205 = 0.4756 
## :: chain 4 :: .......... accept rate = 2000 / 4101 = 0.4877

If we plot the first 100 iterations, notice that the chains with the inefficient tuning parameter did not even get a chance to mix under the center mass of the distribution. However, with better tuning there is already a lot of mixing taking place. Remember, when sampling from an unknown distribution we want the algorithm to get to the interesting part of the distribution and explore it.

Hamiltonian Monte Carlo

The Hamiltonian Monte Carlo algorithm is a concept borrowed from Physics that does a more efficient job of exploring the target distribution. It does this by suppressing the random walk behavior in Metropolis-Hastings. First, we need to introduce some concepts.

In order to do HMC we need the gradient of the posterior distribution with respect to the parameters. If we have a function that takes several inputs then the gradient is the vector of first order partial derivatives. Formally, say we have a function that takes \(J\) parameters and outputs a scalar value, \(f: \mathbb{R}^J \rightarrow \mathbb{R}\). Then the gradient of \(f\) is \(\triangledown f = \bigg(\frac{df}{d\theta_1},\frac{df}{d\theta_2},\ldots,\frac{df}{d\theta_J}\bigg)\).

For each unknown parameter \(\theta_j\) we also need a momentum variable \(\phi_j\). Typically \(\vec{\phi}\) has a multivariate normal distribution with mean zero and mass matrix (i.e. covariance matrix) \(M\). Often \(M\) is a diagonal matrix so that the elements of \(\vec{\phi}\) are independent of each other. Tuning of the algorithm involves multiplying \(M\) by some scalar quantity \(\varepsilon\) that the user can initialize. We say initialize since the algorithm can be setup so that the tuning parameter adapts itself as the chain explores the target distribution.

ALGORITHM (Hamiltonian Monte Carlo)

  1. (Repeat 1-4 \(J\) times.) Draw \(\vec{\phi}\) with \(\vec{\phi} \sim \mathcal{MN}(\vec{0}, M)\).
  2. Repeat a-c \(L\) times to complete one leapfrog step.
    1. Perform a half update on the momentum vector \(\vec{\phi} \leftarrow \vec{\phi} + \frac{1}{2}\varepsilon\triangledown f(\vec{\theta})\)
    2. Update the parameter vector \(\vec{\theta} \leftarrow \vec{\theta} + \varepsilon M^{-1}\vec{\phi}\)
    3. Complete the update of the momentum vector\(\phi \leftarrow \phi + \frac{1}{2}\varepsilon \triangledown f(\vec{\theta})\).
  3. Compute the acceptance ratio \(\alpha = \frac{f(\vec{\theta^*}|y)p(\vec{\theta^*})}{f(\vec{\theta}|y)p(\vec{\theta})}\).
  4. If \(u < \alpha\) then \(\vec{\theta}_n = \vec{\theta^*}\), else \(\vec{\theta}_n = \vec{\theta}_{n-1}\).

The code below creates a function that runs the HMC algorithm using the bivariate normal distribution as the proposal distribution \(g\).

# HAMILTONIAN MONTE CARLO

#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param iter Number of iterations to run the algorithm through.
#' @param chains Number of chains.
#' @param init_val initial values for each chain.
#' @param epsilon_init inital value for tuning parameter.
#' @param M symmetric mass matrix for momentum parameter distribution.
#' @param L Number of leapfrog steps.
#' @return An array of iterations by parameters by chains.
hmc <- function(f, iter = 1e3, chains = 4, init_val = NULL, epsilon_init = 0.1, M = diag(1,2), L_0 = 10) {
  out <- array(NA, c(iter, 2, chains))
  phi <- array(NA, c(iter, 2, chains))
  if(is.null(init_val))
    out[1,,] <- t(rmvnorm(chains, c(0,0), diag(1,2)))
  else
    out[1,,] <- init_val
  phi[1,,] <- rmvnorm(chains, c(0,0), M)

  gradient <- function(theta) {
    e <- 1e-4
    out <- array(NA, length(theta))
    for(t in 1:length(theta)) {
      # browser()
      theta_hi <- theta_lo <- theta
      theta_hi[t] <- theta[t] + e
      theta_lo[t] <- theta[t] - e
      out[t] <- (f(theta_hi) - f(theta_lo)) / (2 * e)
      # browser()
    }
    out
  }

  sampler <- function(theta) {
    phi <- c(rmvnorm(1, mean = c(0,0), sigma = M))
    grad <- gradient(theta)
    epsilon <- runif(1, 0, 2 * epsilon_init)
    L <- ceiling(2 * L_0 * runif(1,0,1))
    for (l in 1:L) {
      phi <- phi + 0.5 * epsilon * grad
      theta <- theta + epsilon * solve(M) %*% phi
      phi <- phi + 0.5 * epsilon * grad
    }
    return(list("theta" = c(theta), "phi" = c(phi)))
  }

  accept_reject <- function(theta_star, theta, phi_star, phi) {
    dens_theta_star <- f(theta_star)
    dens_theta <- f(theta)
    dens_phi_star <- dmvnorm(phi_star, c(0,0), M, log = TRUE)
    dens_phi <- dmvnorm(phi, c(0,0), M, log = TRUE)
    alpha <- exp((dens_theta_star + dens_phi_star) - (dens_theta + dens_phi))
    alpha
  }

  cat(paste0("['.' = ", iter/10, " iterations]\n"))

  for(j in 1:chains) {
    cat(paste0("[ chain ", j, " ]"))
    for(i in 2:iter) {
      alpha <- 0
      u <- 1
      while(alpha < u) {
        u <- runif(1, 0, 1)
        sample <- sampler(out[i-1,,j])
        alpha <- accept_reject(sample$theta, out[i-1,,j], sample$phi, phi[i-1,,j])
      }
      phi[i,,j] <- sample$phi
      out[i,,j] <- sample$theta
      if(i%%(iter/10) == 0)
        cat(".")
      if(i == iter)
        cat("[ complete ]\n")
    }
  }
  out
}

The code below sets up the parameters, creates the data, and specifies the posterior distribution function we want to sample from.

# parameters
covmat <- rbind(c(1,0.9),c(0.9,1))
N <- 1e3
priors <- cbind(rnorm(N, 0, 1), rnorm(N, 0, 1))
# priors <- cbind(rep(0,N), rep(0,N))  # fixed
# data
dat <- array(NA, c(N, 2))
for(i in 1:N) {
  dat[i,] <- rmvnorm(1, mean = priors[i,], sigma = covmat)
}
# posterior distribution
post_func <- function(theta) {
  lik <- sum(dmvnorm(dat, theta, covmat, log = TRUE))
  prior <- dnorm(theta, 0, 0.1, log = TRUE)
  return(lik + sum(prior))
}

We are modeling the data as being generated from the bivariate normal distribution with known covariance matrix \(\Sigma\) and standard normal priors on the mean parameters. Formally, \[ y \sim \mathcal{MN}({\vec{\theta}, \Sigma}) \\ \theta_1 \sim \mathcal{N}(0, 1) \\ \theta_2 \sim \mathcal{N}(0, 1) \\ \]

# setup initial conditions
init_val <- cbind(c(1,1),
                  c(-1,-1),
                  c(1,-1),
                  c(-1,1))
init_val <- init_val * 4

# run HMC
hmc_sim <- hmc(f = post_func, iter = 1e3, init_val = init_val, epsilon_init = ((1/15)^2), M = diag(1,2), L_0 = 10)
## ['.' = 100 iterations]
## [ chain 1 ]..........[ complete ]
## [ chain 2 ]..........[ complete ]
## [ chain 3 ]..........[ complete ]
## [ chain 4 ]..........[ complete ]

We can compare the means from the data and from the parameters sampled using HMC.

mean_table <- rbind("mean data" = colMeans(dat), "mean posterior" = rowMeans(colMeans(hmc_sim)))
colnames(mean_table) <- c("mu1", "mu2")
mean_table
##                        mu1         mu2
## mean data      -0.01779002 -0.01073626
## mean posterior -0.03208769 -0.01607497

We had a lot of data and not such a complicated model so it makes sense that (what would be) the maximum likelihood estimate is close to the posterior means for each location parameter.

The figure below plots the traceplot of both parameters.

The figure below illustrates the convergence of the chains in two dimensions. A contour plot of the actual posterior distribution is overlaid. Notice that all of the mixing is taking place in the center mass of the distribution.

References

Blitzstein, J. K. and Hwang, J. (2014) Introduction to Probaability. Chapman & Hall/CRC Press, New York.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013) Bayesian Data Analysis, 3rd Edition. Chapman & Hall/CRC Press, London.

Robert, C. P. and Casella, G. (2010) Introducing Monte Carlo Methods with R. Springer, New York.