## Introduction

This note goes over writing basic Stan files and fitting the model in R using RStan. We will look at how to go from simple full/no pooling models to simple hierarchical models. The models originally come from Daniel Lee’s slides which are floating around the interweb. We will use the eight schools data set which is declared below,

y <- c(28, 8, -3, 7, -1, 1, 18, 12)        # treatment effect
sigma <- c(15, 10, 16, 11, 9, 11, 10, 18)  # standard error of the treatment effect

The data describes the treatment effect of coaching in standardized testing in eight high schools, along with the associated standard error.

## Anatomy of a Stan File

There are seven blocks that you can define in your Stan file:

• functions {...} for user defined functions.
• data {...} to define the data used to fit the model.
• transformed data {...} for any transformations you want to conduct on the data.
• parameters {...} to define the parameters (i.e. unknowns) in your model (integers are currently not supported).
• transformed parameters {...} for specifying any transformations on your parameters.
• model {...} to define your posterior distribution (i.e. the likelihood of the data and the prior distributions of the parameters).
• generated quantities {...} to generate useful quantities such as posterior predictions, point-wise log likelihood, etc.

For now we will focus only on the data, parameters, and model blocks. Blocks are defined with the name of the block followed by curly brackets. The Stan files that we will use in this note have the following structure:

data {
// ...
}
parameters {
// ...
}
model {
// ...
}


All Stan files need to end with the extension .stan.

In order to call your stan file into the stan() function you need to either 1) specify the file’s path on you machine or 2) set your working directory to the folder in which the file is located. (You can use the getwd() function and the setwd("specify/path/here") function to determine the current working directory or to set the working directory to "specify/path/here", respectively.)

Some popular variable types that you can use in Stan are,

• int which defines an array of integers.
• real which defines an array of reals.
• vector which defines a vector of reals.
• matrix which defines a matrix of reals.

To be clear we should define the bounds on the variables we are declaring. If N defines one dimension of the data then it should be declared as int<lower0> N;. This says that N is an integer and the smallest value it can take is 0. This is reasonable since negative dimensions are nonsensical. You can specify bounds on parameters. This is particularly useful for scale parameters which have a lower bound of 0, or probability parameters which have a lower bound of 0 and an upper bound of 1. for example a probability parameter p should be declared as real<lower=0,upper=1> p;.

## Various Model Specifications

Full pooling and no pooling models tend to be inappropriate unless you can confidently claim that the generative model is that of a full/no pooling model. Typically, it’s better to settle on a model somewhere within these extremes.

### Full Pooling

Here we do not consider the idiosyncracies of each school and have one scalar parameter theta to describe the effect of coaching. We assume that the likelihood of the data is normal. Mathematically our model is,

$y_i \sim \mathcal{N}(\theta, \sigma_i)\ \ \ i = 1,...,8$ where $$\theta$$ is a unknown (scalar) parameter and $$\sigma_i$$ is known.

The 8schools_0.stan file is defined below,

// Full Pooling
data {
int<lower=1> N;            // dimensions of data
real y[N];                 // treatment effect
real<lower=0> sigma[N];    // se of treatment effect
}
parameters {
real theta;                // scalar parameter
}
model {
// likelihood
target += normal_lpdf(y | theta, sigma);
}


The data needs to be passed into the stan() function as an R list() The names of all the elements of the list need to match the names declared in the data{...} block of the Stan file. Additionally, the type of variable you’re passing from R needs to match the type defined in the Stan file (i.e. if you’ve declared v to be a vector in Stan then you need to make sure that v is actually a vector in your data list). In this case we need to define a list that contains an integer named N, an N-dimensional vector called y, and an N-dimensional vector called sigma. We construct this below,

dat <- list(y = y,
sigma = sigma)
dat$N <- length(dat$y)

names(dat)
## [1] "y"     "sigma" "N"
print(dat)
## $y ## [1] 28 8 -3 7 -1 1 18 12 ## ##$sigma
## [1] 15 10 16 11  9 11 10 18
##
## \$N
## [1] 8

Now we can call the stan model and data into stan(). For speed we reduce the number of iterations to 500 (compared to the default of 2000).

Note, if any change has been made to the Stan file (including when you first write your Stan file) it will need to be recomplied. You might notice that there is no output for a while when you run stan(). This is because your Stan file is being translated/complied into C++ code and, depending on how complex your model is, this can take some time.

library(rstan)
rstan_options(auto_write = TRUE)             # complied model saved in same directory as stan file
options(mc.cores = parallel::detectCores())  # run parallel::detectCores() chains in parallel
fit0 <- stan("/Users/imad/desktop/stan/bsss/models/8schools_0.stan", data = dat,
iter = 500, chains = 4)

Make sure that you always check the diagnostics of your model ($$\hat{R}$$ (Rhat), effective sample size (n_eff), traceplots, etc.). Run your model several times to ensure that the diagnostics you’re seeing are consistent.

retro <- c(rgb(255, 102, 136, alpha = 150, max = 255),
rgb(187, 102, 136, alpha = 150, max = 255),
rgb(119, 102, 136, alpha = 150, max = 255),
rgb(51, 102, 136, alpha = 150, max = 255))
print(fit0, digits = 2)
## Inference for Stan model: 8schools_0.
## 4 chains, each with iter=500; warmup=250; thin=1;
## post-warmup draws per chain=250, total post-warmup draws=1000.
##
##         mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## theta   7.50    0.16 3.96  -0.07   4.74   7.41  10.30  15.74   600 1.00
## lp__  -30.15    0.03 0.65 -32.00 -30.30 -29.91 -29.72 -29.67   347 1.01
##
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:29:14 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
traceplot(fit0) + scale_color_manual(values = retro)
## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.