Motivating Example: Heart disease (Faraway Chapter 2)

data(wcgs, package = "faraway")
wcgs

As in the book, we at this moment only focus on 3 variables: height, cigs and chd

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(wcgs, columns = c("height", "cigs", "chd"), ggplot2::aes(colour=chd,alpha = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We are interested in how cigarette usage and height predict heart disease

wcgs$y <- ifelse(wcgs$chd == "no", 0, 1)
plot(y~height, wcgs)

plot(jitter(y, 0.1) ~ jitter(height), wcgs, pch = ".")

If we fit a linear line, it can be outside of the \([0, 1]\) range.

Example: Dose-response study

The data reports the death of adult flour beetles after the exposure to gaseous carbon disulfide at various dosages. The data is in a group-level form.

beetles2 <- read.table("beetles2.dat", header = T)
beetles2

2.1 Group-level data V.S. ungrouped data

Let’s use the probit link.

alive <- beetles2$n - beetles2$dead
data <- matrix(append(beetles2$dead, alive), ncol = 2)
logdose <- beetles2$logdose
dead <- beetles2$dead
n <- beetles2$n
fit.probit <- glm(data ~ logdose, family = binomial(link = probit))
summary(fit.probit)

## 
## Call:
## glm(formula = data ~ logdose, family = binomial(link = probit))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -34.956      2.649  -13.20   <2e-16 ***
## logdose       19.741      1.488   13.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:   9.987  on 6  degrees of freedom
## AIC: 40.185
## 
## Number of Fisher Scoring iterations: 4

Residual deviance is \(9.99\) (with p-value \(0.125\) from the likelihood ratio test, after comparing with the group-level saturated model)

Now let’s check the ungrouped data

Beetles <- read.table("Beetles.dat", header = T)
Beetles

fit.probit2 <- glm(y ~ x, family = binomial(link = probit), data = Beetles)
summary(fit.probit2)

## 
## Call:
## glm(formula = y ~ x, family = binomial(link = probit), data = Beetles)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -34.956      2.649  -13.20   <2e-16 ***
## x             19.741      1.488   13.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.44  on 480  degrees of freedom
## Residual deviance: 371.23  on 479  degrees of freedom
## AIC: 375.23
## 
## Number of Fisher Scoring iterations: 6

Residual deviance is \(371.23\). The log-likelihood ratio test here for the residual deviance is invalid.

The log-log link

This is the fit using the probit link. The data do not support the sysmmetric of the response curve at \(0.5\).

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
abline(h = 0.5, col = "red", lty = 2)

The fit using the logit link

fit.logit <- glm(data ~ logdose, family = binomial(link = logit))
summary(fit.logit)

## 
## Call:
## glm(formula = data ~ logdose, family = binomial(link = logit))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -60.740      5.182  -11.72   <2e-16 ***
## logdose       34.286      2.913   11.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:  11.116  on 6  degrees of freedom
## AIC: 41.314
## 
## Number of Fisher Scoring iterations: 4

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
curve(predict(fit.logit, newdata = list(logdose = x), type = "response"), add = T, lty = 1)

The fitted curve is very similar, the residual deviance is slightly larger.

The fit using the complementary log-log link

fit.cloglog <- glm(data ~ logdose, family = binomial(link = cloglog))
summary(fit.cloglog)

## 
## Call:
## glm(formula = data ~ logdose, family = binomial(link = cloglog))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -39.522      3.236  -12.21   <2e-16 ***
## logdose       22.015      1.797   12.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.2024  on 7  degrees of freedom
## Residual deviance:   3.5143  on 6  degrees of freedom
## AIC: 33.712
## 
## Number of Fisher Scoring iterations: 4

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
curve(predict(fit.cloglog, newdata = list(logdose = x), type = "response"), add = T, lty = 1, col = "blue")

The fitted curve is better, and the residual deviance is smaller.

The fit using the log-log link. In R, we can not directly use a log-log link. We can fit a log-log link Binomial GLM using the cloglog link (see Agresti Chapter 5.6.3)

data2 <- matrix(append(alive, dead), ncol = 2)
fit.loglog <- glm(data2 ~ logdose, family = binomial(link = cloglog))
summary(fit.loglog)

## 
## Call:
## glm(formula = data2 ~ logdose, family = binomial(link = cloglog))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   37.661      2.949   12.77   <2e-16 ***
## logdose      -21.583      1.680  -12.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:  27.573  on 6  degrees of freedom
## AIC: 57.771
## 
## Number of Fisher Scoring iterations: 6

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
curve(1 - predict(fit.loglog, newdata = list(logdose = x), type = "response"), add = T, lty = 1, col = "red")

The fitted curve is much worse, and the residual deviance is larger.

Model visualization and diagnosis for binary data

Let us apply logistic regression on the chd data

lmod <- glm(chd ~ height + cigs, family = binomial, wcgs)
summary(lmod)

## 
## Call:
## glm(formula = chd ~ height + cigs, family = binomial, data = wcgs)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.50161    1.84186  -2.444   0.0145 *  
## height       0.02521    0.02633   0.957   0.3383    
## cigs         0.02313    0.00404   5.724 1.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1781.2  on 3153  degrees of freedom
## Residual deviance: 1749.0  on 3151  degrees of freedom
## AIC: 1755
## 
## Number of Fisher Scoring iterations: 5

Visualization of the predicted response

We can visualize the predicted model along one predictor by fixing any predictors

beta <- coef(lmod)
ilogit <- function(x) exp(x)/(1 + exp(x))

plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height", ylab="Heart Disease",pch=".")
curve(ilogit(beta[1] + beta[2]*x + beta[3]*0),add=TRUE)
curve(ilogit(beta[1] + beta[2]*x + beta[3]*20),add=TRUE,lty=2)

plot(jitter(y,0.1) ~ jitter(cigs), wcgs, xlab="Cigarette Use", ylab="Heart Disease",pch=".")
curve(ilogit(beta[1] + beta[2]*60 + beta[3]*x),add=TRUE)
curve(ilogit(beta[1] + beta[2]*78 + beta[3]*x),add=TRUE,lty=2)

Residual plots

Residual plots for binary response are typically not informative, as binary responses are “too discrete”.

plot(residuals(lmod, type = "deviance")~predict(lmod, type = "link"), main = "", xlab = expression(hat(eta)), ylab = "deviance residual")

To make the residual plot more informative, as suggested in the Faraway book (Chapter 2.4), one option is to bin the observations with adjacent predicted values

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

wcgs <- mutate(wcgs, residuals=residuals(lmod, type = "deviance"), linpred=predict(lmod)) ## attaching two new columns

## Create 100 bins
gdf <- group_by(wcgs, cut(linpred, breaks=unique(quantile(linpred, (1:100)/101))))

## Calculate average of residuals and linear predictors within bins
diagdf <- summarise(gdf, residuals=mean(residuals), linpred=mean(linpred))

plot(residuals ~ linpred, diagdf, xlab="linear predictor")

Here the binned deviance residuals are mostly negative. Because the deviance residuals come from binary data, we don’t have adequate theoretical intuitions of them. There is no guarantee that they should be centered even when the model is correct. In this case, we prefer pearson residuals, which are more interpretative

library(dplyr)
wcgs <- mutate(wcgs, residuals=residuals(lmod, type = "pearson"), linpred=predict(lmod)) ## attaching two new columns

## Create 100 bins
gdf <- group_by(wcgs, cut(linpred, breaks=unique(quantile(linpred, (1:100)/101))))

## Calculate average of residuals and linear predictors within bins
diagdf <- summarise(gdf, residuals=mean(residuals), linpred=mean(linpred))

plot(residuals ~ linpred, diagdf, xlab="linear predictor")

The residuals are centered, not too dispersed and do not have any particular pattern. So we don’t see any apparent problems with the fit.