A Brief Introduction to RStan

Introduction
Anatomy of a Stan File
Various Model Specifications

Introduction

This note goes over writing basic Stan files and fitting the model in R using RStan. We will look at how to go from simple full/no pooling models to simple hierarchical models. The models originally come from Daniel Lee’s slides which are floating around the interweb. We will use the eight schools data set which is declared below,

y <- c(28, 8, -3, 7, -1, 1, 18, 12)        # treatment effect
sigma <- c(15, 10, 16, 11, 9, 11, 10, 18)  # standard error of the treatment effect

The data describes the treatment effect of coaching in standardized testing in eight high schools, along with the associated standard error.

Anatomy of a Stan File

There are seven blocks that you can define in your Stan file:

functions {...} for user defined functions.
data {...} to define the data used to fit the model.
transformed data {...} for any transformations you want to conduct on the data.
parameters {...} to define the parameters (i.e. unknowns) in your model (integers are currently not supported).
transformed parameters {...} for specifying any transformations on your parameters.
model {...} to define your posterior distribution (i.e. the likelihood of the data and the prior distributions of the parameters).
generated quantities {...} to generate useful quantities such as posterior predictions, point-wise log likelihood, etc.

For now we will focus only on the data, parameters, and model blocks. Blocks are defined with the name of the block followed by curly brackets. The Stan files that we will use in this note have the following structure:

data {
// ...
}
parameters {
// ...
}
model {
// ...
}

All Stan files need to end with the extension .stan.

In order to call your stan file into the stan() function you need to either 1) specify the file’s path on you machine or 2) set your working directory to the folder in which the file is located. (You can use the getwd() function and the setwd("specify/path/here") function to determine the current working directory or to set the working directory to "specify/path/here", respectively.)

Some popular variable types that you can use in Stan are,

int which defines an array of integers.
real which defines an array of reals.
vector which defines a vector of reals.
matrix which defines a matrix of reals.

To be clear we should define the bounds on the variables we are declaring. If N defines one dimension of the data then it should be declared as int<lower0> N;. This says that N is an integer and the smallest value it can take is 0. This is reasonable since negative dimensions are nonsensical. You can specify bounds on parameters. This is particularly useful for scale parameters which have a lower bound of 0, or probability parameters which have a lower bound of 0 and an upper bound of 1. for example a probability parameter p should be declared as real<lower=0,upper=1> p;.

Various Model Specifications

Full pooling and no pooling models tend to be inappropriate unless you can confidently claim that the generative model is that of a full/no pooling model. Typically, it’s better to settle on a model somewhere within these extremes.

Full Pooling

Here we do not consider the idiosyncracies of each school and have one scalar parameter theta to describe the effect of coaching. We assume that the likelihood of the data is normal. Mathematically our model is,

\[ y_i \sim \mathcal{N}(\theta, \sigma_i)\ \ \ i = 1,...,8 \] where \(\theta\) is a unknown (scalar) parameter and \(\sigma_i\) is known.

The 8schools_0.stan file is defined below,

// Full Pooling
data {
  int<lower=1> N;            // dimensions of data
  real y[N];                 // treatment effect
  real<lower=0> sigma[N];    // se of treatment effect
}
parameters {
  real theta;                // scalar parameter
}
model {
  // likelihood
  target += normal_lpdf(y | theta, sigma);
}

The data needs to be passed into the stan() function as an R list() The names of all the elements of the list need to match the names declared in the data{...} block of the Stan file. Additionally, the type of variable you’re passing from R needs to match the type defined in the Stan file (i.e. if you’ve declared v to be a vector in Stan then you need to make sure that v is actually a vector in your data list). In this case we need to define a list that contains an integer named N, an N-dimensional vector called y, and an N-dimensional vector called sigma. We construct this below,

dat <- list(y = y,
            sigma = sigma)
dat$N <- length(dat$y)

names(dat)

## [1] "y"     "sigma" "N"

print(dat)

## $y
## [1] 28  8 -3  7 -1  1 18 12
## 
## $sigma
## [1] 15 10 16 11  9 11 10 18
## 
## $N
## [1] 8

Now we can call the stan model and data into stan(). For speed we reduce the number of iterations to 500 (compared to the default of 2000).

Note, if any change has been made to the Stan file (including when you first write your Stan file) it will need to be recomplied. You might notice that there is no output for a while when you run stan(). This is because your Stan file is being translated/complied into C++ code and, depending on how complex your model is, this can take some time.

library(rstan)
rstan_options(auto_write = TRUE)             # complied model saved in same directory as stan file
options(mc.cores = parallel::detectCores())  # run parallel::detectCores() chains in parallel
fit0 <- stan("/Users/imad/desktop/stan/bsss/models/8schools_0.stan", data = dat,
             iter = 500, chains = 4)

Make sure that you always check the diagnostics of your model (\(\hat{R}\) (Rhat), effective sample size (n_eff), traceplots, etc.). Run your model several times to ensure that the diagnostics you’re seeing are consistent.

retro <- c(rgb(255, 102, 136, alpha = 150, max = 255),
           rgb(187, 102, 136, alpha = 150, max = 255),
           rgb(119, 102, 136, alpha = 150, max = 255),
           rgb(51, 102, 136, alpha = 150, max = 255))
print(fit0, digits = 2)

## Inference for Stan model: 8schools_0.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##         mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## theta   7.50    0.16 3.96  -0.07   4.74   7.41  10.30  15.74   600 1.00
## lp__  -30.15    0.03 0.65 -32.00 -30.30 -29.91 -29.72 -29.67   347 1.01
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:29:14 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

traceplot(fit0) + scale_color_manual(values = retro)

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

No Pooling

Now we go to the other end of the spectrum and model each school separately, such that the information used to model each school is not shared amongst schools. Our model looks like,

\[ \begin{aligned} y_1 &\sim \mathcal{N}(\theta_1, \sigma_1) \\ y_2 &\sim \mathcal{N}(\theta_2, \sigma_2)\\ &\vdots\\ y_8 &\sim \mathcal{N}(\theta_8, \sigma_8)\\ \end{aligned} \] The model is similar to the full pooling model except now we assume that each outcome \(y_i\) has its own location parameter \(\theta_i\).

In our Stan file we model the parameters as the vector theta. The 8schools_1.stan file is defined below,

// No Pooling
data {
  int<lower=1> N;            // dimensions of data
  real y[N];                 // treatment effect
  real<lower=0> sigma[N];    // se of treatment effect
}
parameters {
  real theta[N];             // array of parameters
}
model {
  // likelihood
  target += normal_lpdf(y | theta, sigma);
}

fit1 <- stan("/Users/imad/desktop/stan/bsss/models/8schools_1.stan", data = dat,
             iter = 500, chains = 4)

print(fit1, digits = 2)

## Inference for Stan model: 8schools_1.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##            mean se_mean    sd   2.5%    25%    50%    75%  97.5% n_eff
## theta[1]  28.09    0.50 15.94  -4.32  16.98  28.28  39.22  58.98  1000
## theta[2]   7.91    0.33 10.56 -13.02   0.79   7.85  15.41  27.71  1000
## theta[3]  -2.61    0.48 15.08 -31.32 -13.09  -2.32   7.62  26.39  1000
## theta[4]   7.21    0.36 11.40 -15.86  -0.53   7.38  14.93  29.09  1000
## theta[5]  -1.33    0.30  9.36 -19.68  -7.80  -1.39   4.73  18.61  1000
## theta[6]   0.92    0.34 10.88 -19.64  -6.48   1.05   8.26  22.51  1000
## theta[7]  17.93    0.31  9.86  -1.11  11.29  18.23  24.65  37.06  1000
## theta[8]  11.70    0.57 18.11 -24.72  -0.04  11.10  23.58  48.32  1000
## lp__     -31.44    0.10  2.14 -36.53 -32.68 -30.98 -29.86 -28.58   467
##          Rhat
## theta[1] 1.00
## theta[2] 1.00
## theta[3] 1.00
## theta[4] 1.00
## theta[5] 1.00
## theta[6] 1.00
## theta[7] 1.00
## theta[8] 1.00
## lp__     1.01
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:29:49 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

traceplot(fit1) + scale_color_manual(values = retro)

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

Partial Pooling

Now our model gets more interesting. We settle on some middle ground between full pooling and no pooling. We do this by assuming there exists some population-level distribution (or global distribution) that all the \(\theta_i\)’s come from. This distribution controls the degree of pooling. Mathematically the model looks like,

\[ \begin{aligned} y_1 &\sim \mathcal{N}(\theta_1, \sigma_1) &\mbox{likelihood}\\ y_2 &\sim \mathcal{N}(\theta_2, \sigma_2) &\vdots\\ &\vdots\\ y_8 &\sim \mathcal{N}(\theta_8, \sigma_8)\\ \boldsymbol{\theta} &\sim \mathcal{N}(\mu, 5) &\mbox{prior}\\ \mu &\sim \mathcal{N}(0,1) &\mbox{hyperprior} \\ \end{aligned} \]

Here, the population-level location parameter \(\mu\) controls the degree of pooling among the \(\theta_i\)s. We call this a partially pooled model because we have fixed the population-level scale parameter to equal 5, rather than estimating it by way of the posterior distribution.

The 8schools_2.stan file is defined below,

// Partial Pooling
data {
  int<lower=1> N;            // dimensions of data
  real y[N];                 // treatment effect
  real<lower=0> sigma[N];    // se of treatment effect
}
parameters {
  real theta[N];             // array of parameters
  real mu;                   // hyperparameter (location)
}
model {
  // likelihood
  target += normal_lpdf(y | theta, sigma);
  // prior
  target += normal_lpdf(theta| mu, 5);
  // hyperprior
  target += normal_lpdf(mu | 0, 1);
}

fit2 <- stan("/Users/imad/desktop/stan/bsss/models/8schools_2.stan", data = dat,
             iter = 500, chains = 4)

Here, mu controls the degree of pooling for the individual theta values.

print(fit2, digits = 2)

## Inference for Stan model: 8schools_2.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##            mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## theta[1]   3.23    0.16 5.05  -6.60  -0.17   3.24   6.69  13.32  1000    1
## theta[2]   1.83    0.14 4.34  -7.04  -1.18   1.85   4.90  10.01  1000    1
## theta[3]   0.16    0.15 4.81  -8.97  -3.02   0.04   3.41   9.42  1000    1
## theta[4]   1.35    0.15 4.61  -7.74  -1.70   1.24   4.54  10.48  1000    1
## theta[5]   0.08    0.14 4.57  -8.85  -2.85  -0.01   2.94   9.28  1000    1
## theta[6]   0.48    0.14 4.52  -8.13  -2.69   0.52   3.46   9.49  1000    1
## theta[7]   3.86    0.14 4.52  -5.04   0.78   3.76   6.92  12.70  1000    1
## theta[8]   1.10    0.16 5.01  -8.82  -2.29   1.25   4.46  10.67  1000    1
## mu         0.36    0.03 0.96  -1.56  -0.30   0.37   1.02   2.19  1000    1
## lp__     -56.42    0.09 2.07 -61.11 -57.71 -56.19 -54.84 -53.25   508    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:30:28 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

traceplot(fit2) + scale_color_manual(values = retro)

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

Hierarchical

Now we go full hierarchical. We do this by treating all of the parameters in the population-level distribution as unknown. Mathematically our model is,

\[ \begin{aligned} y_1 &\sim \mathcal{N}(\theta_1, \sigma_1) &\mbox{likelihood}\\ y_2 &\sim \mathcal{N}(\theta_2, \sigma_2) &\vdots\\ &\vdots\\ y_8 &\sim \mathcal{N}(\theta_8, \sigma_8)\\ \boldsymbol{\theta} &\sim \mathcal{N}(\mu, \tau) &\mbox{prior} \\ \mu &\sim \mathcal{N}(0,1) &\mbox{hyperprior} \\ \tau &\sim \mathcal{N}(0,1) &\mbox{hyperprior} \\ \end{aligned} \]

Now we are no longer treating the scale parameter \(\tau\) in the population-level distribution as fixed.

The 8schools_3.stan file is defined below,

// Hierarchical
data {
  int<lower=1> N;            // dimensions of data
  real y[N];                 // treatment effect
  real<lower=0> sigma[N];    // se of treatment effect
}
parameters {
  real theta[N];             // array of parameters
  real mu;                   // hyperparameter (location)
  real<lower=0> tau;         // hyperparameter (scale)
}
model {
  // likelihood
  target += normal_lpdf(y | theta, sigma);
  // prior
  target += normal_lpdf(theta| mu, tau);
  // hyperprior
  target += normal_lpdf(mu | 0, 1);
  target += normal_lpdf(tau | 0, 1);
}

fit3 <- stan("/Users/imad/desktop/stan/bsss/models/8schools_3.stan", data = dat,
             iter = 500, chains = 4)

The variables mu and tau control the degree of pooling of the individual theta values for each school.

print(fit3, digits = 2)

## Inference for Stan model: 8schools_3.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##            mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## theta[1]   3.28    0.17 5.31  -7.01  -0.34   3.28   6.81  13.33  1000 1.00
## theta[2]   1.85    0.13 4.20  -6.63  -1.10   1.78   4.86  10.18  1000 1.00
## theta[3]   0.05    0.14 4.54  -8.40  -3.26   0.25   3.26   8.56  1000 1.00
## theta[4]   1.65    0.14 4.38  -6.82  -1.32   1.71   4.62  10.19  1000 1.00
## theta[5]   0.11    0.13 4.17  -8.18  -2.66   0.12   2.79   8.29  1000 1.00
## theta[6]   0.44    0.15 4.62  -8.24  -2.84   0.73   3.56   9.07  1000 1.00
## theta[7]   3.94    0.15 4.68  -4.92   0.62   3.81   7.09  13.08  1000 1.00
## theta[8]   1.15    0.15 4.87  -8.07  -2.27   1.12   4.36  10.30  1000 1.00
## mu         0.35    0.03 0.94  -1.45  -0.30   0.37   0.99   2.18  1000 1.00
## tau        0.81    0.02 0.61   0.04   0.34   0.67   1.13   2.27  1000 1.00
## lp__     -58.32    0.11 2.20 -63.43 -59.65 -57.94 -56.69 -55.10   387 1.01
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:31:05 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

traceplot(fit3) + scale_color_manual(values = retro)

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

Hierarchical (Unstable)

In the model above we assigned hyperpriors to the hyperparameter mu and tau. This is important. If we don’t “guide the search” for these parameters then it becomes difficult for the sampler to explore the posterior distribution and for the parameter values to appropriately converge. Typically, in high-dimensional problems with sparse data it is a good idea to assign informative priors on hyperparameters.

The 8schools_4.stan file is defined below,

// Hierarchical (Unstable)
data {
  int<lower=1> N;            // dimensions of data
  real y[N];                 // treatment effect
  real<lower=0> sigma[N];    // se of treatment effect
}
parameters {
  real theta[N];             // array of parameters
  real mu;                   // hyperparameter (location)
  real<lower=0> tau;         // hyperparameter (scale)
}
model {
  // likelihood
  target += normal_lpdf(y | theta, sigma);
  // prior
  target += normal_lpdf(theta| mu, tau);
  // hyperprior
  // target += normal_lpdf(mu | 0, 1);
  // target += normal_lpdf(tau | 0, 1);
}

fit4 <- stan("/Users/imad/desktop/stan/bsss/models/8schools_4.stan", data = dat,
             iter = 500, chains = 4)

## Warning: There were 958 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

## Warning: Examine the pairs() plot to diagnose sampling problems

The issues when fitting this sort of model should be obvious when looking at the diagnostics. Visually, we see poor mixing among chains.

print(fit4, digits = 2)

## Inference for Stan model: 8schools_4.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##                  mean se_mean   sd           2.5%           25%
## theta[1]  8.48000e+00    0.97 6.37  -1.690000e+00  3.620000e+00
## theta[2]  6.72000e+00    0.92 5.44  -4.490000e+00  3.190000e+00
## theta[3]  7.13000e+00    1.20 6.47  -6.080000e+00  2.430000e+00
## theta[4]  7.56000e+00    1.14 6.18  -5.420000e+00  3.620000e+00
## theta[5]  5.63000e+00    1.22 5.93  -5.020000e+00  1.190000e+00
## theta[6]  5.86000e+00    0.96 5.49  -5.540000e+00  2.860000e+00
## theta[7]  9.38000e+00    0.93 4.95  -1.900000e-01  5.830000e+00
## theta[8]  6.84000e+00    1.05 5.87  -5.100000e+00  3.010000e+00
## mu        7.30000e+00    0.98 4.40  -1.960000e+00  4.340000e+00
## tau      8.47646e+307     Inf  Inf  3.815995e+306 3.853036e+307
## lp__      6.54710e+02    0.16 2.14   6.496400e+02  6.535600e+02
##                    50%           75%         97.5% n_eff Rhat
## theta[1]  8.170000e+00  1.243000e+01  2.260000e+01    43 1.08
## theta[2]  6.630000e+00  1.045000e+01  1.728000e+01    35 1.08
## theta[3]  7.590000e+00  1.211000e+01  1.908000e+01    29 1.08
## theta[4]  7.590000e+00  1.156000e+01  1.934000e+01    29 1.12
## theta[5]  5.510000e+00  1.043000e+01  1.607000e+01    23 1.21
## theta[6]  5.950000e+00  9.630000e+00  1.587000e+01    33 1.09
## theta[7]  9.370000e+00  1.279000e+01  1.925000e+01    29 1.10
## theta[8]  6.870000e+00  1.069000e+01  1.813000e+01    31 1.09
## mu        7.410000e+00  1.081000e+01  1.469000e+01    20 1.18
## tau      8.277068e+307 1.317799e+308 1.700932e+308  1000  NaN
## lp__      6.550500e+02  6.561600e+02  6.580500e+02   175 1.01
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:31:42 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

traceplot(fit4) + scale_color_manual(values = retro)

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.