This is a brief example of how to use the rstanarm package in the context of logistic regression. See the rstanarm documentation (reference manual and vigenettes) for the most up to date material. We will also use the bayesplot package that provides additional plotting functionality that users can apply to rstan/rstanarm objects, and the loo package that provides a way to perform model selection.
Although we will focus on the stan_glm
function to fit models it is useful to be aware of the other models that can be fit using rstanarm. Below we list the function and a very brief “story” about the data which identifies why you would use that particular function to model the data.
stan_lm
stan_glm
family=binomial(link = "logit")
).family="poission"
or family="neg_binomial_2
).stan_glmer
stan_polr
stan_betareg
Below we generate data to be used in logistic regression. The identifying feature of data that can be fit using logistic regression is that the outcome variable is binary.
# the inverse logit function
#' @param x A number fro the real line.
#' @return A probability.
inv_logit <- function(x) {
return(exp(x)/(1+exp(x)))
}
# (not necessary) function to generate multinormal random number generation for correlated predictors
#' @param n An integer value indicating the sample size
#' @param mu A vector of means of length K
#' @param sigma A K-by-K covariance matrix
#' @return A N-by-K matrix of multivariate normal random variables.
generate_multinorm_data <- function(n, mu, sigma) {
if(any((eigen(sigma)$values) < 0))
stop("\none or more eigenvalues of 'sigma' are negative")
x <- matrix(rnorm(n * length(mu)), ncol = length(mu), nrow = n)
sigma_decomp <- svd(sigma)
u <- sigma_decomp$u
d <- diag(sigma_decomp$d,length(sigma_decomp$d))
y <- t(mu + u %*% sqrt(d) %*% t(x))
return(y)
}
# function to generate logistic data
#' @param beta A vector of parameter values.
#' @param X A matrix of covariates.
#' @param const Indicate whether a constant should be included in the
#' data generation process (defaults to TRUE).
#' @return A vector of binary data.
generate_logistic_data <- function(beta, X, cons = TRUE) {
n <- nrow(X)
X_orig <- X
if (cons) {
X <- cbind(rep(1,n),X)
}
prob <- inv_logit(X%*%beta)
y <- rbinom(n, 1, prob)
out <- data.frame(cbind(y,X_orig,prob))
names <- paste0("x", 1:ncol(X_orig))
colnames(out) <- c("y", names, "prob")
return(out)
}
# generate data
N <- 300
prior_mu <- c(-0.5, 0.5, 1.5)
prior_sd <- rep(0.8,length(prior_mu))
beta <- c(rnorm(1, prior_mu[1], prior_sd[1]),
rnorm(1, prior_mu[2], prior_sd[2]),
rnorm(1, prior_mu[3], prior_sd[3]))
X <- generate_multinorm_data(N, beta[-1], matrix(c(1,0.5,0.5,1), ncol = 2))
dat_glm <- generate_logistic_data(beta, X, cons = TRUE)
# summarize dependent variable
table(dat_glm$y)
##
## 0 1
## 110 190
Our model is defined as follows,
\[ y \sim Bin(1, \mathbf{p}) \\ \mbox{logit}(\mathbf{p}) = \mathbf{X}\boldsymbol{\beta} \\ \beta_0 \sim \mathcal{N}(-0.5, 0.5) \\ \beta_1 \sim \mathcal{N}(0.5,0.25) \\ \beta_2 \sim \mathcal{N}(1.0,0.25) \]
Where \(\mbox{logit}(p) = \log\bigg(\frac{p}{1-p}\bigg)\). Note that the logit function \(\log\bigg(\frac{p}{1-p}\bigg)\) maps from the closed unit interval \([0,1]\) to the real line \(\mathbb{R}\). So we can use the inverse of the logit function to map our linear predictor \(\mathbf{X}\boldsymbol{\beta}\) to the closed unit interval and then characterize these values as probabilities which can be passed into the binomial distribution with size \(n=1\).
library(rstanarm)
# fit model using mcmc
fit1 <- stan_glm(y ~ x1 + x2, data = dat_glm, family = binomial(link = "logit"), cores = 4,
prior_intercept = normal(prior_mu[1], prior_sd[1]), prior = normal(prior_mu[-1], 0.5*prior_sd[-1]))
cbind("stan_glm" = coef(fit1), # model fit with mcmc
"beta" = beta, # the parameter realized from the prior distributions
"beta_mu" = prior_mu) # the prior distribution location parameters
## stan_glm beta beta_mu
## (Intercept) -0.6613041 -0.9392810 -0.5
## x1 -0.6930683 -0.8349270 0.5
## x2 0.7392915 0.8497792 1.5
We can take a look at the priors actually used in the model with the prior_summary()
function.
prior_summary(fit1)
## Priors for model 'fit1'
## ------
## Intercept (after predictors centered)
## ~ normal(location = -0.5, scale = 0.8)
##
## Coefficients
## ~ normal(location = [0.5,1.5], scale = [0.4,0.4])
## **adjusted scale = [0.38,0.38]
## ------
## See help('prior_summary.stanreg') for more details
Diagnostics refers to ensuring that the sampler is performing appropriately. First lets look at the summary output available from the stanreg object.
summary(fit1, digits = 2)
##
## Model Info:
##
## function: stan_glm
## family: binomial [logit]
## formula: y ~ x1 + x2
## algorithm: sampling
## priors: see help('prior_summary')
## sample: 4000 (posterior sample size)
## num obs: 300
##
## Estimates:
## mean sd 2.5% 25% 50% 75% 97.5%
## (Intercept) -0.67 0.24 -1.14 -0.82 -0.66 -0.51 -0.21
## x1 -0.69 0.14 -0.98 -0.79 -0.69 -0.60 -0.41
## x2 0.74 0.14 0.47 0.64 0.74 0.83 1.03
## mean_PPD 0.63 0.04 0.55 0.60 0.63 0.65 0.70
## log-posterior -192.55 1.26 -195.83 -193.10 -192.22 -191.64 -191.16
##
## Diagnostics:
## mcse Rhat n_eff
## (Intercept) 0.00 1.00 2673
## x1 0.00 1.00 2684
## x2 0.00 1.00 2573
## mean_PPD 0.00 1.00 3240
## log-posterior 0.03 1.00 1700
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Model Info provides some high-level information about the type of model you have fit, algorithm used, size of the posterior sample (which equals chains
* (iter
- warmup
)), etc.
Estimates provides statistics on the posterior samples for the parameters. Here we have information on,
mean_PPD
), which is the mean of the predictions made using the estimated parameters.Diagnostics provides information about the sampler’s performance.
Below is one of the more important diagnostic plots. It is a traceplot of each parameter. Notice how, for each parameter, the chains mix well. This is also reflected in the Rhat values equalling one as mentioned above.
library(bayesplot)
library(ggplot2)
color_scheme_set("mix-blue-pink")
plot(fit1, plotfun = "trace") + ggtitle("Traceplots")