HaiBiostat: Fitting a Simple Linear Regression in Bayesian Context

Hai Nguyen

Frequentist approach in simple linear regression

An example with data

I use the data in Kruschke (2015) which has
- male
- height
- weight

We will use height to predict weight of a person

DT::datatable(dta, 
          rownames = FALSE,
          filter = list(position = "top"))

Fit a model

fit <- lm(dta$weight ~ dta$height)
summary(fit)


Call:
lm(formula = dta$weight ~ dta$height)

Residuals:
   Min     1Q Median     3Q    Max 
-63.95 -21.17  -5.26  16.24 201.94 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -104.7832    31.5056  -3.326 0.000992 ***
dta$height     3.9822     0.4737   8.406 1.77e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.59 on 298 degrees of freedom
Multiple R-squared:  0.1917,    Adjusted R-squared:  0.189 
F-statistic: 70.66 on 1 and 298 DF,  p-value: 1.769e-15

Then, we can see the fitted regression line overlay with data

plot(dta$height, dta$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
     main = "Scatter Plot between Height and Weight by Gender",
     pch = as.numeric(dta$male), col = as.factor(dta$male))
abline(fit, col = "orange", lwd = 3)

Model fit diagnostics

There are four assumptions associated with a linear regression model:

Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

I check the model fit by plotting:

plot of residuals against predictor variable (how to check residuals have non-linear patterns or not)
normal probability plot (how to check residuals are normally distributed: residuals follow a straight line well or do they deviate severely; it’s good if residuals are lined well on the straight dashed line.)
plot of square root of standardized residual absolute value (how to check the assumption of equal variance (homoscedasticity): if residuals are spread equally along the ranges of predictors; it’s good if we see a horizontal line with equally (randomly) spread points.)

And to look at the outlier and leverage via

plot standardized residuals vs leverage

par(mfrow=c(2,2))
plot(fit)

Lastly, we look at the influential points

n <- dim(dta)[1]
cooksD <- cooks.distance(fit)
#identify influential points
(influential_obs <- as.numeric(names(cooksD)[(cooksD > (4/n))]))

 [1]   2 117 134 140 163 164 169 172 212 233 260 263

#plot cooksD
plot(cooksD, pch="*", cex=2, main="Influential Obs by Cooks distance")  
abline(h = 4/n, col="red")  # add cutoff line
text(x=1:length(cooksD), y=cooksD, labels=ifelse((cooksD>(4/n)),names(cooksD),""), col="red", pos = 4)

Till now, we can say that the linear regression assumption is violated in this case, e.g. error is not following the normal distribution. Therefore, how about we delete the influential points and re-fit the model:

dta.outliers_removed <- dta[-influential_obs, ]

fit.outliers_removed <- lm(dta.outliers_removed$weight ~ dta.outliers_removed$height)
summary(fit.outliers_removed)


Call:
lm(formula = dta.outliers_removed$weight ~ dta.outliers_removed$height)

Residuals:
   Min     1Q Median     3Q    Max 
-55.16 -18.87  -3.18  16.55  83.27 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -155.9116    26.8300  -5.811 1.65e-08 ***
dta.outliers_removed$height    4.7112     0.4032  11.685  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.36 on 286 degrees of freedom
Multiple R-squared:  0.3231,    Adjusted R-squared:  0.3208 
F-statistic: 136.5 on 1 and 286 DF,  p-value: < 2.2e-16

Somehow, we saw the model assumption is satisfied when the influential cases removed.

par(mfrow=c(1,2))
plot(fit.outliers_removed, which=c(1,2))

The regression line changes when we remove the influential observations. Dangerous!!!

par(mfrow=c(1,2))
plot(dta$height, dta$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
     main = "With Outliers")
abline(fit, col = "orange", lwd = 3)

plot(dta.outliers_removed$height, dta.outliers_removed$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
     main = "Outliers removed")
abline(fit.outliers_removed, col = "orange", lwd = 3)

But, the action of deleting of influential cases is often not the solution due to produce the bias estimates.

Through this case, we have reviewed:

Influential points = Outliers & Leverage

A point that makes a lot of difference in a regression case, is called ‘an influential point’. Usually influential points have two characteristics:

They are outliers, i.e. graphically they are far from the pattern described by the other points, that means that the relationship between x and y is different for that point than for the other points.
They are in a position of high leverage, meaning that the value of the variable x is far from the mean. Observations with very low or very high values of x are in positions of high leverage.

Message take-away

Removing influential cases is not the optimal solution.
In the real-life analysis, we can use other types of regression:
- Ridge regression
- Robust regression
- IRLS Robust regression
- Lowess method
- Regression trees
- and etc…

Bayesian approach

Introduction to a regression model in Bayesian way

\[ y = \beta_0 + \beta_1 x + \epsilon \]

With Bayesian approach distribution of \(\epsilon\) does not have to be Gaussian (normal), we are going to use robust assumption.

Parameterization of the model for MCMC (adapted from Kruschke (2015))

Bayes theorem for this model:

\[ p(\beta_0, \beta_1, \sigma, \gamma \mid D) = \frac{p(D \mid \beta_0, \beta_1, \sigma, \gamma) \ p(\beta_0, \beta_1, \sigma, \gamma)}{\int \int \int \int p(D \mid \beta_0, \beta_1, \sigma, \gamma) \ p(\beta_0, \beta_1, \sigma, \gamma) \ d\beta_0 \ d\beta_1 \ d\sigma \ d\gamma} \]

Create the data list.

y <- dta$weight
x <- dta$height
dataList <- list(x = x, y = y)

MCMC in JAGS

Describe the model.

Based on the Normal distribution (demonstrating purpose, not run)

modstring_norm = "
# Specify the Normal model for none-standardized data:
model {
    for (i in 1:Ntotal) {
        y[i] ~ dnorm(mu[i], prec)
        mu[i] = b[1] + b[2]*log_income[i] 
    }
    
    for (i in 1:2) {
        b[i] ~ dnorm(0.0, 1.0/1.0e6)
    }
    
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
} "

Based on the Student-t distribution, robust assumption

 modelString = "
# Standardize the data:
data {
    Ntotal <- length(y)
    xm <- mean(x)
    ym <- mean(y)
    xsd <- sd(x)
    ysd <- sd(y)
    for ( i in 1:length(y) ) {
      zx[i] <- (x[i] - xm) / xsd
      zy[i] <- (y[i] - ym) / ysd
    }
}
# Specify the model for standardized data:
model {
    for ( i in 1:Ntotal ) {
      zy[i] ~ dt( zbeta0 + zbeta1 * zx[i] , 1/zsigma^2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0 ~ dnorm(0, 1/(10)^2 )  
    zbeta1 ~ dnorm(0, 1/(10)^2 )
    zsigma ~ dunif(1.0E-3, 1.0E+3 )
    nu ~ dexp(1/30.0)
    # Transform to original scale:
    beta1 <- zbeta1 * ysd / xsd  
    beta0 <- zbeta0 * ysd  + ym - zbeta1 * xm * ysd / xsd 
    sigma <- zsigma * ysd
}
"
# Write out modelString to a text file
writeLines(modelString, con="TEMPmodel.txt")

In the tutorial, we just want to execute the model specified by t-student distribution.

Every arrow has a corresponding line in the descriptive diagram.

Variable names starting with “z” mean that these variables are standardized (z-scores).

Strong correlation creates thin and long shape on scatter-plot of the variables which makes Gibbs sampling very slow and inefficient.

But remember to scale back to the original measures.

HMC implemented in Stan does not have this problem. This can be applied to STAN in all situation !!!

parameters = c("beta0" ,  "beta1" ,  "sigma", 
              "zbeta0" , "zbeta1" , "zsigma", "nu")
adaptSteps = 500  # Number of steps to "tune" the samplers
burnInSteps = 1000
nChains = 4 
thinSteps = 1
numSavedSteps=20000
nIter = ceiling((numSavedSteps*thinSteps ) / nChains )
jagsModel = jags.model("TEMPmodel.txt", data=dataList ,
                      n.chains=nChains, n.adapt=adaptSteps)
update(jagsModel, n.iter=burnInSteps)
codaSamples = coda.samples(jagsModel, variable.names=parameters, 
                          n.iter=nIter, thin=thinSteps)

Explore the MCMC object

summary(codaSamples)


Iterations = 1501:6500
Thinning interval = 1 
Number of chains = 4 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

             Mean       SD  Naive SE Time-series SE
beta0  -139.96225 27.61760 0.1952859      0.2180717
beta1     4.46120  0.41330 0.0029225      0.0032290
nu        5.37901  1.62796 0.0115114      0.0236101
sigma    23.97943  1.65132 0.0116766      0.0204363
zbeta0   -0.09632  0.04806 0.0003398      0.0004283
zbeta1    0.49046  0.04544 0.0003213      0.0003550
zsigma    0.68372  0.04708 0.0003329      0.0005827

2. Quantiles for each variable:

            2.5%       25%        50%       75%      97.5%
beta0  -193.3580 -158.7464 -140.16434 -121.3445 -8.624e+01
beta1     3.6527    4.1830    4.46491    4.7407  5.257e+00
nu        3.1509    4.2578    5.05987    6.1485  9.410e+00
sigma    20.8323   22.8476   23.94254   25.0519  2.728e+01
zbeta0   -0.1888   -0.1282   -0.09682   -0.0644 -4.115e-04
zbeta1    0.4016    0.4599    0.49087    0.5212  5.779e-01
zsigma    0.5940    0.6514    0.68267    0.7143  7.779e-01

plot(codaSamples, trace=TRUE, density=FALSE) # note: many graphs

autocorr.plot(codaSamples, ask=F)

effectiveSize(codaSamples)

    beta0     beta1        nu     sigma    zbeta0    zbeta1    zsigma 
16054.484 16401.066  4768.900  6535.965 12892.980 16401.066  6535.965

#gelman.diag(codaSamples)
gelman.plot(codaSamples)   # lines may return error ==> Most likely reason: collinearity of parameters

(HDIofChains <- lapply(codaSamples, function(z) cbind(Mu = hdi(codaSamples[[1]][,1]), Sd = hdi(codaSamples[[1]][,2]))))

[[1]]
            var1     var1
lower -192.46444 3.660595
upper  -85.49078 5.257061

[[2]]
            var1     var1
lower -192.46444 3.660595
upper  -85.49078 5.257061

[[3]]
            var1     var1
lower -192.46444 3.660595
upper  -85.49078 5.257061

[[4]]
            var1     var1
lower -192.46444 3.660595
upper  -85.49078 5.257061

Look at strong correlation between beta0 and beta1 which slows Gibbs sampling down.

head(as.matrix(codaSamples[[1]]))

         beta0    beta1       nu    sigma      zbeta0    zbeta1
[1,] -168.7337 4.890447 5.623225 23.90661 -0.10407253 0.5376553
[2,] -158.6082 4.744414 6.371647 24.56065 -0.09181883 0.5216004
[3,] -166.9662 4.873701 6.346842 22.92533 -0.08537616 0.5358143
[4,] -161.6680 4.795882 3.209357 23.00506 -0.08162923 0.5272588
[5,] -160.5706 4.801212 3.153090 22.15810 -0.04024801 0.5278448
[6,] -155.9331 4.684888 4.394775 21.44653 -0.12823063 0.5150561
        zsigma
[1,] 0.6816415
[2,] 0.7002899
[3,] 0.6536626
[4,] 0.6559359
[5,] 0.6317868
[6,] 0.6114980

pairs(as.matrix(codaSamples[[1]])[,1:4])

MCMC in Stan

Describe the model in Stan

In order to give a vague priors to slope and intercept consider the following arguments:

The largest possible value of slope is

\[ \frac{\sigma_y}{\sigma_x} \]

when variables \(x\) and \(y\) are perfectly correlated.

Then standard deviation of the slope parameter \(\beta_1\) should be large enough to make the maximum value easily achievable.

Size of intercept is defined by value of

\[ E[X] \frac{\sigma_y}{\sigma_x} \]

So, the prior should have enough width to include this value.

modelString = "
data {
    int<lower=1> Ntotal;
    real x[Ntotal];
    real y[Ntotal];
    real meanY;
    real sdY;
    real meanX;
    real sdX;
}
transformed data {
    real unifLo;
    real unifHi;
    real expLambda;
    real beta0sigma;
    real beta1sigma;
    unifLo = sdY/1000;
    unifHi = sdY*1000;
    expLambda = 1/30.0;
    beta1sigma = 10*fabs(sdY/sdX);
    beta0sigma = 10*(sdY^2+sdX^2)    / 10*fabs(meanX*sdY/sdX);
}
parameters {
    real beta0;
    real beta1;
    real<lower=0> nu; 
    real<lower=0> sigma; 
}
model {
    sigma ~ uniform(unifLo, unifHi); 
    nu ~ exponential(expLambda);
    beta0 ~ normal(0, beta0sigma);
    beta1 ~ normal(0, beta1sigma);
    for (i in 1:Ntotal) {
        y[i] ~ student_t(nu, beta0 + beta1 * x[i], sigma);
    }
}
"

stanDsoRobustReg <- stan_model(model_code=modelString)

dat<-list(Ntotal=length(dta$weight), 
          y=dta$weight, 
          meanY=mean(dta$weight),
          sdY=sd(dta$weight),
          x=dta$height,
          meanX=mean(dta$height),
          sdX=sd(dta$height))

fitSimRegStan <- sampling(stanDsoRobustReg, 
             data=dat, 
             pars=c('beta0', 'beta1', 'nu', 'sigma'),
             iter=5000, chains = 4, cores = 4)

Save the fitted object.

Explore the MCMC object

print(fitSimRegStan)

Inference for Stan model: 99bd7b261ff9240bf0c6d7b1b21831d6.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

          mean se_mean    sd     2.5%      25%      50%      75%
beta0  -139.35    0.50 27.89  -194.81  -157.61  -139.51  -120.74
beta1     4.45    0.01  0.42     3.64     4.17     4.45     4.72
nu        5.37    0.03  1.61     3.15     4.25     5.07     6.16
sigma    23.98    0.03  1.66    20.81    22.84    23.96    25.07
lp__  -1265.00    0.02  1.43 -1268.55 -1265.74 -1264.68 -1263.94
         97.5% n_eff Rhat
beta0   -85.12  3072    1
beta1     5.28  3117    1
nu        9.39  3420    1
sigma    27.31  3579    1
lp__  -1263.20  3332    1

Samples were drawn using NUTS(diag_e) at Thu Apr 22 15:10:38 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

plot(fitSimRegStan)

rstan::traceplot(fitSimRegStan, ncol=1, inc_warmup=F)

pairs(fitSimRegStan, pars=c('nu','beta0','beta1','sigma'))

stan_scat(fitSimRegStan, c('beta0','beta1'))

stan_scat(fitSimRegStan, c('beta1','sigma'))

stan_scat(fitSimRegStan, c('beta0','sigma'))

stan_scat(fitSimRegStan, c('nu','sigma'))

stan_dens(fitSimRegStan)

stan_ac(fitSimRegStan, separate_chains = T)

stan_diag(fitSimRegStan,information = "sample",chain=0)

stan_diag(fitSimRegStan,information = "stepsize",chain = 0)

stan_diag(fitSimRegStan,information = "treedepth",chain = 0)

stan_diag(fitSimRegStan,information = "divergence",chain = 0)

Work with shinystan object.

launch_shinystan(fitSimRegStan)

Using fitted regression model for prediction

Recall that the data in this example contains predictor height and output weight for a group of people from Ht-Wt.csv (data above).

Plot all heights observed in the sample and check the summary of the variable.

plot(1:length(dat$x),dat$x)

summary(dat$x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54.60   64.00   66.20   66.39   69.20   76.00

Can we predict weight of a person who is 50 or 80 inches tall?

To do this we can go through all pairs of simulated parameters (\(\beta_0\), \(\beta_1\)) and use them to simulate \(y(50)\) and \(y(80)\).
This gives distribution of predicted values.

summary(fitSimRegStan)

$summary
              mean     se_mean         sd         2.5%          25%
beta0  -139.354196 0.503202352 27.8915373  -194.808900  -157.605829
beta1     4.451593 0.007474065  0.4173043     3.636611     4.172879
nu        5.366643 0.027556563  1.6115181     3.151177     4.246276
sigma    23.982791 0.027670259  1.6554405    20.814300    22.838675
lp__  -1265.002519 0.024734258  1.4276951 -1268.552672 -1265.738311
               50%          75%        97.5%    n_eff     Rhat
beta0  -139.512830  -120.737878   -85.121838 3072.271 1.000770
beta1     4.454489     4.723609     5.282330 3117.396 1.000699
nu        5.069790     6.157253     9.391004 3419.954 1.000994
sigma    23.957637    25.072876    27.313013 3579.322 1.000550
lp__  -1264.682923 -1263.938424 -1263.195323 3331.756 1.000423

$c_summary
, , chains = chain:1

         stats
parameter         mean         sd         2.5%          25%
    beta0  -140.496987 27.9128445  -195.452362  -159.028578
    beta1     4.468322  0.4176474     3.643375     4.180768
    nu        5.333266  1.6129391     3.146164     4.239934
    sigma    23.996371  1.6756857    20.745408    22.833622
    lp__  -1264.973734  1.3931068 -1268.262683 -1265.705478
         stats
parameter          50%          75%        97.5%
    beta0  -140.690732  -121.306242   -85.591653
    beta1     4.473340     4.745248     5.301112
    nu        5.036853     6.074138     9.402810
    sigma    23.966000    25.101448    27.242245
    lp__  -1264.658401 -1263.936454 -1263.201543

, , chains = chain:2

         stats
parameter         mean         sd         2.5%          25%
    beta0  -140.119919 27.7217798  -195.188356  -158.408294
    beta1     4.462118  0.4146886     3.668582     4.177839
    nu        5.396927  1.5988486     3.155521     4.277927
    sigma    23.990640  1.6168350    21.056831    22.846187
    lp__  -1264.946317  1.4072435 -1268.606024 -1265.657309
         stats
parameter          50%          75%        97.5%
    beta0  -140.000666  -121.163812   -86.601618
    beta1     4.462406     4.731267     5.288368
    nu        5.129432     6.181671     9.348566
    sigma    23.918413    25.056950    27.345459
    lp__  -1264.644540 -1263.891813 -1263.183147

, , chains = chain:3

         stats
parameter         mean         sd         2.5%          25%
    beta0  -137.758043 28.0473599  -193.976570  -156.709899
    beta1     4.428479  0.4196264     3.620966     4.142890
    nu        5.273822  1.5474425     3.130205     4.160287
    sigma    23.905531  1.6471567    20.672028    22.808588
    lp__  -1265.023664  1.3965268 -1268.555138 -1265.759590
         stats
parameter          50%         75%        97.5%
    beta0  -137.963462  -118.57774   -83.786093
    beta1     4.430414     4.70807     5.278970
    nu        4.999793     6.07770     9.087928
    sigma    23.906472    24.98220    27.219559
    lp__  -1264.712841 -1263.99760 -1263.196266

, , chains = chain:4

         stats
parameter         mean         sd         2.5%          25%
    beta0  -139.041834 27.8184750  -193.491107  -156.858089
    beta1     4.447452  0.4163603     3.625069     4.186441
    nu        5.462557  1.6789159     3.209953     4.299050
    sigma    24.038623  1.6794892    20.805817    22.868034
    lp__  -1265.066361  1.5085815 -1268.785015 -1265.858500
         stats
parameter          50%          75%        97.5%
    beta0  -139.288564  -121.639438   -83.861266
    beta1     4.449144     4.707789     5.268918
    nu        5.106526     6.256394     9.576797
    sigma    24.014382    25.132032    27.449061
    lp__  -1264.738186 -1263.933356 -1263.202000

regParam<-cbind(Beta0=rstan::extract(fitSimRegStan,pars="beta0")$'beta0',
                Beta1=rstan::extract(fitSimRegStan,pars="beta1")$'beta1')
head(regParam)

         Beta0    Beta1
[1,] -174.7224 4.962250
[2,] -118.6827 4.113129
[3,] -152.0820 4.701561
[4,] -147.0165 4.564773
[5,] -159.9549 4.751269
[6,] -126.6841 4.251516

predX50<-apply(regParam,1,function(z) z%*%c(1,50))
predX80<-apply(regParam,1,function(z) z%*%c(1,80))

Plot both distributions, look at their summaries and HDIs.

suppressWarnings(library(HDInterval))
den<-density(predX50)
plot(density(predX80),xlim=c(60,240))
lines(den$x,den$y)

summary(cbind(predX50,predX80))

    predX50          predX80     
 Min.   : 55.42   Min.   :195.2  
 1st Qu.: 78.48   1st Qu.:212.9  
 Median : 83.19   Median :216.8  
 Mean   : 83.23   Mean   :216.8  
 3rd Qu.: 87.97   3rd Qu.:220.6  
 Max.   :111.77   Max.   :240.3

rbind(predX50=hdi(predX50),predX80=hdi(predX80))

            lower     upper
predX50  69.09173  97.20656
predX80 205.32725 228.15029

Both JAGS and Stan produced the identical results.

Compare the fit between FA and BA

plot(dta$height, dta$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
     main = "With Outliers")
abline(fit, col = "orange", lwd = 2)
abline(a=-139.96225, b= 4.46120, col = "blue", lwd = 1)
abline(fit.outliers_removed, col = "red", lty="dashed", lwd =1)

	With influential	Without influential	Bayesian approach w/ robust assumption
	orange, solid	red, dashed	blue, solid
Intercept	-104.78	-155.91	-139.96
Slope	3.98	4.71	4.46

Great! From the comparison, we can see that using Bayesian Methods to fit simple linear regression can be robust when the traditional regression has the influential points.

Fitting a Simple Linear Regression in Bayesian Context

Frequentist approach in simple linear regression

An example with data

Fit a model

Model fit diagnostics

Message take-away

Bayesian approach

Introduction to a regression model in Bayesian way

MCMC in JAGS

Describe the model.

Explore the MCMC object

MCMC in Stan

Describe the model in Stan

Explore the MCMC object

Using fitted regression model for prediction

Compare the fit between FA and BA

Further reading

References

Corrections

Reuse

Citation