Generated Quantities in RStan

Introduction
AR(1) Time Series Model
Posterior Predictions
Point-Wise Log Likelihood
Generalizing to AR(p)

Introduction

This note covers how to extract samples from an RStan object and some situations in which you might want to use the generated quantities block in your Stan file. Specifically we cover generating posterior predictions and evaluating the point-wise posterior log likelihood of an autoregressive process with one time lag, known as AR(1). AR(p) processes are limited in their ability to model time series and models such as vector auto regression and gaussian processes tend to be much better alternatives (both of which can be fit in Stan).

AR(1) Time Series Model

The motivation for an AR(1) model is if you believe your outcome variable at time \(t\) can be explained by the outcome variable at time \(t-1\). For example, we may believe that the monthly unemployment rate of a country depends on the previous months rate.

Mathematically we define an AR(1) process as, \[ \begin{align} y_t \sim \mathcal{N}(\alpha + \rho y_{t-1}, \sigma) \end{align} \] where \(\alpha\), \(\rho\), and \(\sigma\) are parameters. To make sure the series is stationary (i.e. well behaved and not explosive) we can bound the parameter on the lag variable such that \(\rho \in [-1,1]\).

This is a generative model so we can simulate the data and see if we can recover the parameters by fitting the model in Stan.

# AR(1) data generation process
alpha <- 3
rho <- 0.6
dat <- list(N = 200)
dat$y[1] <- 1
for (n in 2:dat$N)
  dat$y[n] <- alpha + rho * dat$y[n-1] + rnorm(1, 0, 1)

Our Stan model will look something like,

// AR(1) Model
data {
  int<lower=0> N;   // number of observations
  real y[N];        // time series data
}
parameters {
  real<lower=-1,upper=1> rho;   // parameter on lag term
  real alpha;                   // constant term
  real<lower=0> sigma;          // sd of error
}
model {
  // likelihood
  for (n in 2:N)
    target+= normal_lpdf(y[n] | alpha + rho * y[n-1], sigma);
  // priors
  target+= normal_lpdf(rho | 0 , 0.5);
  target+= normal_lpdf(alpha | 0 , 1);
}

Notice that we’ve put bounds on the data and parameters using angle brackets <...>.

library(rstan)
rstan_options(auto_write = TRUE)             # complied model saved in same directory as stan file
options(mc.cores = parallel::detectCores())  # run parallel::detectCores() chains in parallel
# fit the model
fit0 <- stan("models/AR1.stan", data = dat,
             iter = 500, chains = 4)

We can take a look at the diagnostics by printing fit0 to the console and examining the effective sample sizes and \(\hat{R}\). We can also look at the traceplots to visually determine whether the chains have mixed well.

retro <- c(rgb(255, 102, 136, alpha = 150, max = 255),
           rgb(187, 102, 136, alpha = 150, max = 255),
           rgb(119, 102, 136, alpha = 150, max = 255),
           rgb(51, 102, 136, alpha = 150, max = 255))
print(fit0, digits = 2)

## Inference for Stan model: AR1.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## rho      0.64    0.00 0.05    0.55    0.61    0.64    0.67    0.74   392
## alpha    2.65    0.02 0.36    1.91    2.44    2.66    2.90    3.34   400
## sigma    0.93    0.00 0.05    0.84    0.89    0.93    0.96    1.02   528
## lp__  -273.32    0.07 1.31 -276.57 -273.97 -272.95 -272.33 -271.84   402
##       Rhat
## rho      1
## alpha    1
## sigma    1
## lp__     1
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:48:58 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

traceplot(fit0) + scale_color_manual(values = retro)

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

We can extract the samples from fit0 as a matrix as follows,

post <- as.matrix(fit0)
head(post)

##           parameters
## iterations       rho    alpha     sigma      lp__
##       [1,] 0.6129643 2.880150 0.9324324 -271.8917
##       [2,] 0.6976333 2.191782 1.0122295 -274.3695
##       [3,] 0.6036338 2.889651 0.8815505 -272.5073
##       [4,] 0.6594617 2.485426 0.9259884 -272.0639
##       [5,] 0.6460159 2.636972 0.8975701 -271.8743
##       [6,] 0.6518459 2.563735 0.9139478 -271.8362

dim(post)

## [1] 1000    4

Notice that the number of rows of post are equal to (iter - warmup) * chains = 1000 posterior samples. lp__ is the log of the total probability density/mass function up to some normalizing constant. We can use these samples plot the distributions (i.e. histograms) of and the credible intervals associated with each parameter. In the plots below the blue lines denote the 50% credible interval.

par(mfrow=c(1,3), oma=c(0,0,2,0))
# distribution for alpha
hist(post[,"alpha"], breaks = 50, col = "#808080", border = FALSE,
     main = expression(paste("Distribution of ", alpha)),
     xlab = expression(alpha))
alpha50 <- quantile(post[,"alpha"], prob = c(0.25, 0.75), names = FALSE)
for (i in 1:2) {
  abline(v = alpha50[i], lty = 2, lwd = 2, col = "#40D2FE")
}
# distribution for rho
hist(post[,"rho"], breaks = 50, col = "#808080", border = FALSE,
     main = expression(paste("Distribution of ", rho)),
     xlab = expression(rho))
rho50 <- quantile(post[,"rho"], prob = c(0.25, 0.75), names = FALSE)
for (i in 1:2) {
  abline(v = rho50[i], lty = 2, lwd = 2, col = "#40D2FE")
}
# distribution for sigma
hist(post[,"sigma"], breaks = 50, col = "#808080", border = FALSE,
     main = expression(paste("Distribution of ", sigma)),
     xlab = expression(sigma))
sigma50 <- quantile(post[,"sigma"], prob = c(0.25, 0.75), names = FALSE)
for (i in 1:2) {
  abline(v = sigma50[i], lty = 2, lwd = 2, col = "#40D2FE")
}
title("Marginal Posterior Distributions of Parameters", outer = TRUE, cex.main = 1.5)

Posterior Predictions

Using the generated quantities{...} block we can generate posterior predictions to see if our model appropriately fits the data. Since we modeled the data using the normal distribution we can generate predictions with the normal_rng() function. The Stan model that implements this is provided below.

// AR(1) Model with posterior predictions
data {
  int<lower=0> N;   // number of observations
  real y[N];        // time series data
}
parameters {
  real<lower=-1,upper=1> rho;   // parameter on lag term
  real alpha;                   // constant term
  real<lower=0> sigma;          // sd of error
}
model {
  // likelihood
  for (n in 2:N)
    target+= normal_lpdf(y[n] | alpha + rho * y[n-1], sigma);
  // priors
  target+= normal_lpdf(rho | 0 , 0.5);
  target+= normal_lpdf(alpha | 0 , 1);
}
generated quantities {
  vector[N-1] y_rep;
  // posterior predictions
  for (n in 1:(N-1))
    y_rep[n] = normal_rng(alpha + rho * y[n], sigma);
    // recall: obs n is predicting obs n+1
}

fit1 <- stan("models/AR1_pp.stan", data = dat,
             iter = 500, chains = 4)

traceplot(fit1, pars = c("alpha", "rho", "sigma")) + scale_color_manual(values = retro)

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

print(fit1, pars = c("alpha", "rho", "sigma"))

## Inference for Stan model: AR1_pp.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##       mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## alpha 2.67    0.02 0.37 1.95 2.42 2.67 2.93  3.41   296 1.01
## rho   0.64    0.00 0.05 0.54 0.60 0.64 0.68  0.73   295 1.01
## sigma 0.93    0.00 0.05 0.84 0.89 0.92 0.96  1.03   386 1.01
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:49:02 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

We can extract the matrix of posterior samples as mentioned above. Notice that we now have a bunch of y_rep values. (Essentially we have a matrix of y_rep values the dimension of which is the number of samples by the length of y_rep.) This is because for each sampled value of the parameter we can generate a prediction.

# extract the samples as a matrix
post <- as.matrix(fit1)
colnames(post)[1:100]

##   [1] "rho"       "alpha"     "sigma"     "y_rep[1]"  "y_rep[2]" 
##   [6] "y_rep[3]"  "y_rep[4]"  "y_rep[5]"  "y_rep[6]"  "y_rep[7]" 
##  [11] "y_rep[8]"  "y_rep[9]"  "y_rep[10]" "y_rep[11]" "y_rep[12]"
##  [16] "y_rep[13]" "y_rep[14]" "y_rep[15]" "y_rep[16]" "y_rep[17]"
##  [21] "y_rep[18]" "y_rep[19]" "y_rep[20]" "y_rep[21]" "y_rep[22]"
##  [26] "y_rep[23]" "y_rep[24]" "y_rep[25]" "y_rep[26]" "y_rep[27]"
##  [31] "y_rep[28]" "y_rep[29]" "y_rep[30]" "y_rep[31]" "y_rep[32]"
##  [36] "y_rep[33]" "y_rep[34]" "y_rep[35]" "y_rep[36]" "y_rep[37]"
##  [41] "y_rep[38]" "y_rep[39]" "y_rep[40]" "y_rep[41]" "y_rep[42]"
##  [46] "y_rep[43]" "y_rep[44]" "y_rep[45]" "y_rep[46]" "y_rep[47]"
##  [51] "y_rep[48]" "y_rep[49]" "y_rep[50]" "y_rep[51]" "y_rep[52]"
##  [56] "y_rep[53]" "y_rep[54]" "y_rep[55]" "y_rep[56]" "y_rep[57]"
##  [61] "y_rep[58]" "y_rep[59]" "y_rep[60]" "y_rep[61]" "y_rep[62]"
##  [66] "y_rep[63]" "y_rep[64]" "y_rep[65]" "y_rep[66]" "y_rep[67]"
##  [71] "y_rep[68]" "y_rep[69]" "y_rep[70]" "y_rep[71]" "y_rep[72]"
##  [76] "y_rep[73]" "y_rep[74]" "y_rep[75]" "y_rep[76]" "y_rep[77]"
##  [81] "y_rep[78]" "y_rep[79]" "y_rep[80]" "y_rep[81]" "y_rep[82]"
##  [86] "y_rep[83]" "y_rep[84]" "y_rep[85]" "y_rep[86]" "y_rep[87]"
##  [91] "y_rep[88]" "y_rep[89]" "y_rep[90]" "y_rep[91]" "y_rep[92]"
##  [96] "y_rep[93]" "y_rep[94]" "y_rep[95]" "y_rep[96]" "y_rep[97]"

Below we plot the actual series along with the 50% and 90% credible intervals, and the column-wise mean of the posterior predictions.

# selector for y_rep columns
sel <- grep("y_rep", colnames(post))
# compute the credible intervals
ci50 <- matrix(NA, nrow = length(sel), ncol = 2)
ci90 <- matrix(NA, nrow = length(sel), ncol = 2)
for (i in 1:length(sel)) {
  ci50[i,] <- quantile(post[,sel[i]], prob = c(0.25, 0.75), names = FALSE)
  ci90[i,] <- quantile(post[,sel[i]], prob = c(0.05, 0.95), names = FALSE)
}
# plot the true series along with posterior predictions
plot(0, type= "n", xlim = c(0,length(dat$y)), ylim = range(post[, sel]),
     xlab = "time", ylab = "y", main = "AR(1) Process (simulated data)")
# plot credible intervals
t <- 2:dat$N
polygon(c(rev(t), t), c(rev(ci90[,1]), ci90[,2]), col = "#FF668830", border = FALSE)
polygon(c(rev(t), t), c(rev(ci50[,1]), ci50[,2]), col = "#FF668880", border = FALSE)
# plot avg of predictions
lines(2:dat$N, colMeans(post[,sel]), col = "#40D2FE", lwd = 2)
# plot true series
lines(dat$y, col = "#808080", lwd = 2)
legend("bottomright", c("series", "pred (avg)", "pred (50% CI)", "pred (90% CI)"),
       col = c("#808080", "#40D2FE", "#FF668880", "#FF668830"),
       lwd = c(2,2,2))

Point-Wise Log Likelihood

Another useful quantity to generate is the log likelihood of the posterior. Once we do this we can extract these values with the loo package (or manually) and evalute the expected predictive accuracy which can be used to compare models. Since we modeled the data using the normal distribution we can use the normal_lpdf() function to evaluate the posterior log likelihood at each sampled parameter value. This is implemented in the Stan file below.

// AR(1) Model with posterior predictions and log likelihood
data {
  int<lower=0> N;   // number of observations
  real y[N];        // time series data
}
parameters {
  real<lower=-1,upper=1> rho;   // parameter on lag term
  real alpha;                   // constant term
  real<lower=0> sigma;          // sd of error
}
model {
  // likelihood
  for (n in 2:N)
    target+= normal_lpdf(y[n] | alpha + rho * y[n-1], sigma);
  // priors
  target+= normal_lpdf(rho | 0 , 0.5);
  target+= normal_lpdf(alpha | 0 , 1);
}
generated quantities {
  vector[N-1] y_rep;
  vector[N-1] log_lik;
  for (n in 1:(N-1)) {
    // posterior predictions
    y_rep[n] = normal_rng(alpha + rho * y[n], sigma);
    // log likelihood
    log_lik[n] = normal_lpdf(y[n] | alpha + rho * y[n], sigma);
  }
}

Now we need to fit the model and check the diagnostics.

fit2 <- stan("models/AR1_pp_ll.stan", data = dat,
             iter = 500, chains = 4)

print(fit2, pars = c("alpha", "rho", "sigma"))

## Inference for Stan model: AR1_pp_ll.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##       mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## alpha 2.70    0.02 0.34 2.05 2.46 2.69 2.92  3.42   300 1.01
## rho   0.64    0.00 0.05 0.54 0.61 0.64 0.67  0.72   316 1.00
## sigma 0.92    0.00 0.05 0.84 0.89 0.92 0.95  1.03   530 1.00
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:49:06 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Using the extract_log_lik() function in the loo pacakge we can easily extract the point-wise log likelihood values and evaluate elpd_loo (the expected predictive accuracy).

library(loo)
ll_mat <- extract_log_lik(fit2, parameter_name = "log_lik")
loo(ll_mat[,-1])

## Computed from 1000 by 198 log-likelihood matrix
## 
##          Estimate  SE
## elpd_loo   -189.3 2.5
## p_loo         1.1 0.2
## looic       378.6 4.9
## 
## All Pareto k estimates are good (k < 0.5)
## See help('pareto-k-diagnostic') for details.

Generalizing to AR(p)

The above model works reasonably well but they are not written efficiently. Here are a few reasons why,

The model is only useful for AR processes with one lag.
Loops are being used in the model block. With large data sets and complex models we should vectorize variables and use linear algebra where we can for computational efficiency.
We have fixed the parameters on the priors when we could easily pass these in as data.

The Stan model below presents one way to implement an AR(p) model. We utilize the transformed data {...} block to transform the time series y based on its dimension (N) and the number of lags the user declares (P). (Note, at the beginning of the transformed data block we need to define all the variables that we want to create within the block.)

// AR(p) model
data {
  int<lower=0> N;         // length of series
  int<lower=1,upper=N> P; // number of lags
  vector[N] y;            // time series
  real loc_alpha;         // prior loc on alpha
  real scale_alpha;       // prior scale on alpha
  vector[P] loc_rho;      // prior loc on rho
  vector[P] scale_rho;    // prior scale on rho
}
transformed data {
  // transform data to accommodate P-lag process
  vector[N-P] y_trans;    // outcome of series
  matrix[N-P, P] Y;       // matrix of (lagged) predictors
  for (n in 1:(N-P)) {
    y_trans[n] = y[n+P];
    for (p in 1:P)
      Y[n,p] = y[(P + n) - p];
  }
}
parameters {
  real alpha;                         // intercept
  vector<lower=-1,upper=1>[P] rho;    // vector of parameters
  real<lower=0> sigma;                // sd of error
}
model {
  // likelihood
  target+= normal_lpdf(y_trans | alpha + Y * rho, sigma);
  // priors
  target+= normal_lpdf(rho | loc_rho, scale_rho);
  target+= normal_lpdf(alpha | loc_alpha, scale_alpha);
  target+= normal_lpdf(sigma | 0, 1);
}
generated quantities {
  vector[N-P] y_rep;
  for (n in 1:(N-P))
    y_rep[n] = normal_rng(alpha + Y[n,] * rho, sigma);
}

As mentioned above, for an AR(1) model, stationarity can be induced by bounding the parameter on the lag term between -1 and 1. However, this is not sufficient for AR(p) models where p > 1. Thus, the AR(p) model defined above is not a generative model. Below we fit an AR(3) model on the data that was generated by an AR(1) process.

# The data block now includes more elements and we need to pass those in
dat$P <- 3
dat$loc_alpha <- 0
dat$scale_alpha <- 1
dat$loc_rho <- rep(0, dat$P)
dat$scale_rho <- rep(0.1, dat$P)
# fit the AR(p) model
fit3 <- stan("models/ARp.stan", data = dat,
             iter = 500, chains = 4)

print(fit3, pars = c("alpha", "rho", "sigma"))

## Inference for Stan model: ARp.
## 4 chains, each with iter=500; warmup=250; thin=1; 
## post-warmup draws per chain=250, total post-warmup draws=1000.
## 
##        mean se_mean   sd  2.5%  25%  50%  75% 97.5% n_eff Rhat
## alpha  3.24    0.02 0.45  2.35 2.95 3.22 3.56  4.10   762 1.01
## rho[1] 0.43    0.00 0.06  0.31 0.39 0.44 0.48  0.55   675 1.00
## rho[2] 0.08    0.00 0.06 -0.04 0.03 0.08 0.12  0.20   625 1.01
## rho[3] 0.05    0.00 0.05 -0.05 0.01 0.05 0.09  0.15   649 1.00
## sigma  0.94    0.00 0.05  0.85 0.91 0.94 0.97  1.03   724 1.00
## 
## Samples were drawn using NUTS(diag_e) at Wed Apr 12 19:49:10 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).