Introduction

The purpose of this note is two-fold. First, we want to empirically explore how prior distributions influence posterior distributions. Second, we want to explore the trade-off between model complexity and big data. Most techiques taught in Statistics deal with naive models and small data, while the machine learning community tends to deal with simple models and large amounts of data. Bayesian approaches are particularly useful at capturing uncertainty in the parameters, with sparse data and complex models. With implementations like Stan, Bayesian methods are heading in a direction which will allow efficient computation on big data without without losing the transparency of directly specifying a complex model.


One Observation and One Parameter

This space is not really interesting. Pedagogically it may be useful but even then it is dry. However, we’ll start here since the counterpart, fitting complex models on big data, is interesting and contemporarily challenging.

Note that beliefs can be dichotomized as beliefs about the model and beliefs about the parameter. That is to say we have assumptions about the data generating process which includes the likelihood of the data and the distributions of the parameters. In this section we deliberately separate the steps, however, every Bayesian model can be broken down into such components.


Belief about the data/model

Assume you have a single observation \(y\) that you believe has been generated from the binomial distribution according to some unknown probability parameter \(\theta\) and some known sample size \(n\). For example, you have data on \(n\) (Bernoulli) experiments and information on how many of the \(n\) experiments resulted in a success, denoted \(y\). In addition to the prior distribution being bound on the closed unit interval [0,1], you might also have some prior knowledge as to the distribution that \(\theta\) comes from. This might arise from domain specific knowledge and might encourage you to believe that \(\theta\) is close to (or far from) some set of values on the closed unit interval (i.e. that some event is very likely or less likely to occur).

We can specify our belief about the model and prior information using Bayes’ theorem: \[ p(\theta | y) = \frac{f(y | \theta)g(\theta)}{\int_{\Omega}f(y | \theta)g(\theta)\ d\theta} \]

Given that our single observation comes from the binomial distribution, the likelihood of our data \(f(y | \theta)\) is simply the binomial probability mass function for \(y\) given \(\theta\). Formally,

\[ f(y|\theta) = \binom{n}{y}\theta^{y}(1-\theta)^{n-y} \]

As previously discussed, Frequentists will tend to stop here and find the maximum likelihood estimate for \(\theta\) (i.e. the value of \(\theta\) that maximizes the (joint) probability of observing the data).

Belief about the parameter

Our domain specific knowledge might lead us to consider various distributional specifications for our prior distribution \(g(\theta)\). Here we will consider the beta, uniform, normal, and Cauchy distributions as prior distributions for \(\theta\): \[ \begin{aligned} g_b(\theta | \alpha, \beta) &= \frac{\theta^{\alpha - 1}(1-\theta)^{\beta - 1}}{B(\alpha,\beta)} \\ g_u(\theta) &= 1 \\ g_n(\theta | \mu, \sigma) &= \left[\sigma\sqrt{2\pi}\right]^{-1}e^{- \frac{(\theta - \mu)^2}{2\sigma^2}} \\ g_c(\theta | x_0, \gamma) &= \left\{\pi\gamma\left[1+\left(\frac{\theta - x_0}{\gamma}\right)^2\right]\right\}^{-1} \end{aligned} \]

Our posterior distribution of \(\theta\) will adjust depending on the functional form used on the parameter \(\theta\). Why? Because we are probabilistically weighting values of \(\theta\) in addition to determining how likely these values are to have generated the data. If we are just interested in the value of \(\theta\) that is most likely to generate the data then we are not saying anything about which values of \(\theta\) are more likely. However, by defining \(\theta\) as a random variable that comes from a probability distribution, we can make a probabilistic statement about which values of \(\theta\) are more likely to occur. This means that we end up with a probability distribution that describes \(\theta\) both before and after observing the data (prior and posterior, respectively).

Sampling from the grid

Sampling from the grid (or grid approximation) can be thought of as a “brute force” way to estimate your posterior distribution. It is not the best way to obtain a sample from the posterior but its simplicity allows the components of the posterior to be easily identified in an applied setting. Its name derives from the procedure of using a “grid” of candidate parameter values (e.g. a vector or a matrix) to determine your posterior probabilities associated with each candidate parameter or candidate \(k\)-tuple of parameters (for models with \(k\) parameters). We can then use these posterior probabilities to sample from the grid of parameter values (with replacement). This will give us a distribution associated with the parameter.

In order to sample from the grid we need to specify a function that can compute the posterior probability associated with each candidate value (or at least the probability up to a normalizing constant). We also need a function to sample from the grid of candidate parameter values using the posterior probability associated with each value. The sample() function in R can be used to accomplish this. However, in this section we do not sample, but rather plot the posterior distribution associated with each candidate parameter value. In other words, we are looking at the distribution that the posterior samples will converge to as the number of samples approaches infinity. Posterior values will be sampled from the grid starting with the next section.

Below is the R code to calculate the posterior probabilities of each value of \(\theta\) using beta, uniform, normal, and Cauchy prior distributions, respectively. The first three lines setup the grid and the data.

prob <- seq(0, 1, by=0.01)  # grid of candidate parameter values (theta)
x <- 5                      # number of successes
n <- 10                     # number of trials

# beta prior
binom_beta <- function(x, n, theta, alpha, beta) {
  lik <- dbinom(x, n, theta)
  prior <- dbeta(theta, alpha, beta)
  post <- (lik * prior) / sum(lik * prior)
  return(list('lik' = lik, 'prior' = prior, 'post' = post))
}
# uniform prior
binom_unif <- function(x, n, theta, alpha, beta) {
  lik <- dbinom(x, n, theta)
  prior <- dunif(theta, alpha, beta)
  post <- (lik * prior) / sum(lik * prior)
  return(list('lik' = lik, 'prior' = prior, 'post' = post))
}
# normal prior
binom_norm <- function(x, n, theta, loc, scale) {
  lik <- dbinom(x, n, theta)
  prior <- dnorm(theta, loc, scale)
  post <- (lik * prior) / sum(lik * prior)
  return(list('lik' = lik, 'prior' = prior, 'post' = post))
}
# cauchy prior
binom_cauchy <- function(x, n, theta, loc, scale) {
  lik <- dbinom(x, n, theta)
  prior <- dcauchy(theta, loc, scale)
  post <- (lik * prior) / sum(lik * prior)
  return(list('lik' = lik, 'prior' = prior, 'post' = post))
}

The models below run through the binom_beta() function which uses the \(Beta(\alpha,\beta_i)\) prior distribution with \(\alpha = 2\) and increasing values for \(\beta_i\in[2,11]\). We then plot the distribution of the likelihood of the data, the distribution of the prior on \(\theta\), and the posterior distribution of \(\theta\). The figure below illustrates the relationship between the prior and the posterior as the Beta prior belief on probability shifts towards zero.

The figure below shows the relationship of the prior and the posterior as the Beta prior belief on probability shifts towards one. In this case the prior is using \(\alpha_i\in[2,11]\) and \(\beta = 2\).

Using a \(Unif(0,1)\) prior does not change the shape of the posterior (i.e. no prior information is being encoded in the model since the uniform probability density function is constant (either zero or one in this case) for all values in its support).

The figure below shows the relationship between the prior and the posterior when using a truncted \(\mathcal{N}(\mu_i,\sigma)\) prior with \(\mu_i\in[0,0.8]\) and \(\sigma = 1\). This is a less “extreme” approach compared to the beta prior distribution.

The figure below shows the relationship between the prior and the posterior when using a truncted \(\mathcal{N}(\mu,\sigma_i)\) prior with \(\mu = 0\) and \(\sigma_i = [0.2,1]\). Notice that shifting the scale parameter greatly affects the shape of the prior, and thus the posterior, distributions. In this case more weight is being placed on values of \(\theta\) closer to zero as the standard deviation shrinks towards 0.

The figure below shows the relationship between the prior and the posterior when using a truncted \(Cauchy(x_{0_i},\gamma)\) prior with \(x_{0_i}\in[0,0.8]\) and \(\gamma = 1\). We cannot readily see the main difference between the Cauchy distribution and the normal distribution since theta is bound on the unit interval. However, the merit of using a Cauchy distribution as a prior is that it puts more weight in the tails, compared to the normal distribution.

The figure below shows the relationship between the prior and the posterior when using a truncated \(Cauchy(x_0,\gamma_i)\) prior with \(x_{0} = 0\) and \(\gamma\in[0.2,1]\). Similar to the scale parameter in the normal distribution, as we reduce the scale parameter we tighten the distribution closer around its location (in this case zero). Contrary to the normal distribution, the visible tail of the Cauchy distribution has more mass under it indicating that more weight will be placed on values of \(\theta\) closer to one than the corresponding location-scale pair used in the normal distribution.

One Observation and Multiple Parameters

Complex models with small data are really interesting since you can readily see the superiority of the Bayesian approch. When you are fitting sparse data to complex models it’s particularly useful (and sometimes essential) to probabilistically specify the space that you search for your parameters. Casting a wide (uniform) net will certainly pick up some value, but if you know your parameters values are not somewhere (or better yet you know where they are), then it makes sense to confine the search accordingly. If not then you can end up with inferences that are incorrect.

Our data is a single sample from the univariate normal distribution with an unknown standard deviation and a mean that is normally distributed. This observation is approximately -2.7. Thus, the mean of the observation is -2.7, which is quite far from what the mean would converge to under a large sample (i.e. zero). Below we fit the data according the the following model, \[ y \sim \mathcal{N}(\mu,\sigma) \\ \mu \sim \mathcal{N}(0,1) \\ \sigma \sim \mathcal{N}(0,1) \]

remove(list = ls())
y <- -2.69833  # this value was generated with rnorm(1, rnorm(1, 0, 5), 1)
mu_grid <- seq(-5, 5, by=0.01)
sd_grid <- seq(0.01, 10, by=0.01)

post_norm <- function(y, mu_grid, sd_grid) {
  lik_fun <- function(mu, sd) {               # likelihood function
    dnorm(y, mu, sd)
  }
  prior_mu <- dnorm(mu_grid, 0 ,1)            # prior on mu
  prior_sd <- dnorm(sd_grid, 0, 1)            # prior on sd
  prior <- outer(prior_mu, prior_sd)          # outer prod for priors
  lik <- outer(mu_grid, sd_grid, "lik_fun")   # outer prod proc through lik_fun()
  post <- (lik * prior) / (sum(lik * prior))  # posterior probability grid
  return(list("lik" = lik, "prior" = prior, "post" = post))
}

# evaluate full posterior grid
post_full <- post_norm(y, mu_grid, sd_grid)
# prob dist of mu
post_mu <- rowSums(post_full$post)
# prob dist of sd
post_sd <- colSums(post_full$post)

# bayesian estimate of mu and sd
b_est <- NULL
b_est[1] <- mean(sample(mu_grid, 10000, replace = TRUE, prob = post_mu))
b_est[2] <- mean(sample(sd_grid, 10000, replace = TRUE, prob = post_sd))

# mle of mu and sd
mle <- optim(c(0,1), function(par){dnorm(y, par[1], par[2])}, control = list(fnscale = -1),
                 method = "L-BFGS-B", lower = c(-10,0.01), upper = c(10, 10))$par

The plots below illustrate the likelihood, prior, and posterior. We also compare the posterior estimate with the maximum likelihood estimate. The estimates of the location parameter are particularly interesting. When applying a prior distribution that applies more weight to the appropriate subset of the support we get a value closer to the true value of \(\mu\). MLE relies solely on the data in order to determine what the estimated parameter value is and, given the limited information available in the data, estimates it incorrectly.

Multiple Observations One Parameter

This space is challenging particularly because basic functions/models become difficult to fit when using big data if they’re not scaled appropriately. In the context of big data and simple models, there is enough information in the data that will allow the frequentist methods to find an appropriate estimate (conditional on the data). However, this might not be right right estimate since the beliefs about the model (i.e. the specficiation of the likelihood of the data) might be oversimplified.


The code below fits the following model, \[ y \sim \mathcal{N}(\mu, 1) \\ \mu \sim \mathcal{N}(0, 0.1) \]

remove(list = ls())
N <- 25
prior_mu <- rnorm(N, 0, 5)
y <- rnorm(N, prior_mu, 2)
mu_grid <- seq(-2, 2, by=0.01)
# assume standard deviation is known.

post_norm <- function(y, mu_grid, sd) {
  lik_fun <- function(mu, sd) {               # likelihood function
    out <- NULL
    for(i in 1:length(mu)) {
      out[i] <- prod(dnorm(y, mu[i], sd))
      }
    return(out)
  }
  lik <- lik_fun(mu_grid, sd)
  prior <- dnorm(mu_grid, 0, 0.1)               # prior on mu
  post <- (lik * prior) / sum(lik * prior)      # posterior probability grid
  return(list("lik" = lik, "prior" = prior, "post" = post))
}

# evaluate posterior distribution of mu
post_full <- post_norm(y, mu_grid, sd = 1)

Multiple Observations Multiple Parameters

This is the most interesting space. Here we have complex model that defines the data generation process and the beliefs about the parameter distributions, and we have a large set of observations that we can fit the model to. The model here, while not exciting, has been scaled according given the previous models that we have been working with.


The code below fits the following model,

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu,\sigma)\ \ \ \ \ \forall\ i \in N \\ \mu &\sim \mathcal{N}(0,1) \\ \sigma &\sim \mathcal{N}(0,1) \end{aligned} \]

remove(list = ls())
N <- 20
prior_mu <- rnorm(N, 0, 5)
prior_sd <- NULL
for(i in 1:N) {
  prior_sd[i] <- rnorm(1, 0, 1)
  while(prior_sd[i]<=0)
    prior_sd[i] <- rnorm(1, 0, 1)
}
y <- rnorm(N, prior_mu, prior_sd)
mu_grid <- seq(-5, 5, by=0.01)
sd_grid <- seq(0, 10, by=0.01)

post_norm <- function(y, mu_grid, sd_grid) {
  lik_fun <- function(mu, sd) {                 # likelihood function
    prod(dnorm(y, mu, sd))
  }
  cat('... evaluating likelihood\n')
  lik <- outer(mu_grid, sd_grid, Vectorize(lik_fun))
  prior_mu <- dnorm(mu_grid, 0, 1)               # prior on mu
  prior_sd <- dnorm(sd_grid, 0, 1)               # prior on sd
  prior <- outer(prior_mu, prior_sd)
  post <- (lik * prior) / (sum(lik * prior))     # posterior probability grid
  return(list("lik" = lik, "prior" = prior, "post" = post))
}

# evaluate the full posterior and the marginal posteriors
post_full <- post_norm(y, mu_grid, sd_grid)
post_mu <- apply(post_full$post, 1, sum)
post_sd <- apply(post_full$post, 2, sum)

As before the likelihood, prior, and posterior are plotted, along with the maximum likelihood estimate and posterior estimate for \(\mu\) and \(\sigma\).