## Introduction

This is a brief example of how to use the rstanarm package in the context of logistic regression. See the rstanarm documentation (reference manual and vigenettes) for the most up to date material. We will also use the bayesplot package that provides additional plotting functionality that users can apply to rstan/rstanarm objects, and the loo package that provides a way to perform model selection.

Although we will focus on the stan_glm function to fit models it is useful to be aware of the other models that can be fit using rstanarm. Below we list the function and a very brief “story” about the data which identifies why you would use that particular function to model the data.

• stan_lm
• $$y \in \mathbb{R}$$.
• Useful for modeling a continuous outcome that is linear in terms of parmeters.
• stan_glm
• $$y \in \{0, n\}$$ where $$n \in \mathbb{N}$$.
• Useful for modeling bernoulli trials (family=binomial(link = "logit")).
• Useful for modeling count data (family="poission" or family="neg_binomial_2).
• stan_glmer
• $$y$$ can be continuous or discrete.
• Useful for modeling hierarchical structure.
• stan_polr
• $$y \in \{1, J\}$$ where $$J \in \mathbb{N}$$.
• Useful for mdoeling ordinal outcomes.
• stan_betareg
• $$y \in (0,1)$$.
• Useful for modeling rates/proportions/ratings.

## Data Generation Process

Below we generate data to be used in logistic regression. The identifying feature of data that can be fit using logistic regression is that the outcome variable is binary.

# the inverse logit function
#' @param x A number fro the real line.
#' @return A probability.
inv_logit <- function(x) {
return(exp(x)/(1+exp(x)))
}
# (not necessary) function to generate multinormal random number generation for correlated predictors
#' @param n An integer value indicating the sample size
#' @param mu A vector of means of length K
#' @param sigma A K-by-K covariance matrix
#' @return A N-by-K matrix of multivariate normal random variables.
generate_multinorm_data <- function(n, mu, sigma) {
if(any((eigen(sigma)$values) < 0)) stop("\none or more eigenvalues of 'sigma' are negative") x <- matrix(rnorm(n * length(mu)), ncol = length(mu), nrow = n) sigma_decomp <- svd(sigma) u <- sigma_decomp$u
d <- diag(sigma_decomp$d,length(sigma_decomp$d))
y <- t(mu + u %*% sqrt(d) %*% t(x))
return(y)
}
# function to generate logistic data
#' @param beta A vector of parameter values.
#' @param X A matrix of covariates.
#' @param const Indicate whether a constant should be included in the
#'              data generation process (defaults to TRUE).
#' @return A vector of binary data.
generate_logistic_data <- function(beta, X, cons = TRUE) {
n <- nrow(X)
X_orig <- X
if (cons) {
X <- cbind(rep(1,n),X)
}
prob <- inv_logit(X%*%beta)
y <- rbinom(n, 1, prob)

out <- data.frame(cbind(y,X_orig,prob))
names <- paste0("x", 1:ncol(X_orig))
colnames(out) <- c("y", names, "prob")

return(out)
}
# generate data
N <- 300
prior_mu <- c(-0.5, 0.5, 1.5)
prior_sd <- rep(0.8,length(prior_mu))
beta <- c(rnorm(1, prior_mu, prior_sd),
rnorm(1, prior_mu, prior_sd),
rnorm(1, prior_mu, prior_sd))
X <- generate_multinorm_data(N, beta[-1], matrix(c(1,0.5,0.5,1), ncol = 2))
dat_glm <- generate_logistic_data(beta, X, cons = TRUE)
# summarize dependent variable
table(dat_glm\$y)
##
##   0   1
## 110 190

Our model is defined as follows,

$y \sim Bin(1, \mathbf{p}) \\ \mbox{logit}(\mathbf{p}) = \mathbf{X}\boldsymbol{\beta} \\ \beta_0 \sim \mathcal{N}(-0.5, 0.5) \\ \beta_1 \sim \mathcal{N}(0.5,0.25) \\ \beta_2 \sim \mathcal{N}(1.0,0.25)$

Where $$\mbox{logit}(p) = \log\bigg(\frac{p}{1-p}\bigg)$$. Note that the logit function $$\log\bigg(\frac{p}{1-p}\bigg)$$ maps from the closed unit interval $$[0,1]$$ to the real line $$\mathbb{R}$$. So we can use the inverse of the logit function to map our linear predictor $$\mathbf{X}\boldsymbol{\beta}$$ to the closed unit interval and then characterize these values as probabilities which can be passed into the binomial distribution with size $$n=1$$.

## Model Fitting

library(rstanarm)
# fit model using mcmc
fit1 <- stan_glm(y ~ x1 + x2, data = dat_glm, family = binomial(link = "logit"), cores = 4,
prior_intercept = normal(prior_mu, prior_sd), prior = normal(prior_mu[-1], 0.5*prior_sd[-1]))
cbind("stan_glm" = coef(fit1),  # model fit with mcmc
"beta" = beta,            # the parameter realized from the prior distributions
"beta_mu" = prior_mu)     # the prior distribution location parameters
##               stan_glm       beta beta_mu
## (Intercept) -0.6613041 -0.9392810    -0.5
## x1          -0.6930683 -0.8349270     0.5
## x2           0.7392915  0.8497792     1.5

We can take a look at the priors actually used in the model with the prior_summary() function.

prior_summary(fit1)
## Priors for model 'fit1'
## ------
## Intercept (after predictors centered)
##  ~ normal(location = -0.5, scale = 0.8)
##
## Coefficients
##  ~ normal(location = [0.5,1.5], scale = [0.4,0.4])
## ------
## See help('prior_summary.stanreg') for more details

## Diagnostics

Diagnostics refers to ensuring that the sampler is performing appropriately. First lets look at the summary output available from the stanreg object.

summary(fit1, digits = 2)
##
## Model Info:
##
##  function:  stan_glm
##  family:    binomial [logit]
##  formula:   y ~ x1 + x2
##  algorithm: sampling
##  priors:    see help('prior_summary')
##  sample:    4000 (posterior sample size)
##  num obs:   300
##
## Estimates:
##                 mean    sd      2.5%    25%     50%     75%     97.5%
## (Intercept)     -0.67    0.24   -1.14   -0.82   -0.66   -0.51   -0.21
## x1              -0.69    0.14   -0.98   -0.79   -0.69   -0.60   -0.41
## x2               0.74    0.14    0.47    0.64    0.74    0.83    1.03
## mean_PPD         0.63    0.04    0.55    0.60    0.63    0.65    0.70
## log-posterior -192.55    1.26 -195.83 -193.10 -192.22 -191.64 -191.16
##
## Diagnostics:
##               mcse Rhat n_eff
## (Intercept)   0.00 1.00 2673
## x1            0.00 1.00 2684
## x2            0.00 1.00 2573
## mean_PPD      0.00 1.00 3240
## log-posterior 0.03 1.00 1700
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Model Info provides some high-level information about the type of model you have fit, algorithm used, size of the posterior sample (which equals chains * (iter - warmup)), etc.

Estimates provides statistics on the posterior samples for the parameters. Here we have information on,

• The mean and standard deviation of the marginal posterior distribution of each parameter.
• A set of quantiles. For example, 50% of the distribution for the (Intercept) parameter lies in the interval [-0.72,-0.37].
• The mean of the posterior predictive distribution (mean_PPD), which is the mean of the predictions made using the estimated parameters.
• The log-posterior is the logarithm of the likelihood times the prior up to some normalizing constant.

Diagnostics provides information about the sampler’s performance.

• Rhat is a measure of convergence among chains. It is the ratio of the average variance of the draws within each chain to the variance of the pooled draws across chains. If Rhat is 1 then the chains are in equilibirium (i.e. the chains have converged). You should be concerned if Rhat is greater than 1. Sometimes this can be rectified by increasing the number of iterations. Other times it might be an issue with the way your model is identified.
• n_eff is a rough measure of effective sample size.
• mcse is the “Monte Carlo Standard Error” a measure of inaccuracy of Monte Carlo samples.

Below is one of the more important diagnostic plots. It is a traceplot of each parameter. Notice how, for each parameter, the chains mix well. This is also reflected in the Rhat values equalling one as mentioned above.

library(bayesplot)
library(ggplot2)
color_scheme_set("mix-blue-pink")
plot(fit1, plotfun = "trace") + ggtitle("Traceplots")