How to fit a linear regression using Bayesian Methods
Consider a Bayesian model fit as a remedial measures for influential
case
I use the data in Kruschke
(2015) which
has
- male
- height
- weight
We will use height
to predict weight
of a
person
::datatable(dta,
DTrownames = FALSE,
filter = list(position = "top"))
<- lm(dta$weight ~ dta$height)
fit summary(fit)
Call:
lm(formula = dta$weight ~ dta$height)
Residuals:
Min 1Q Median 3Q Max
-63.95 -21.17 -5.26 16.24 201.94
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -104.7832 31.5056 -3.326 0.000992 ***
dta$height 3.9822 0.4737 8.406 1.77e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 31.59 on 298 degrees of freedom
Multiple R-squared: 0.1917, Adjusted R-squared: 0.189
F-statistic: 70.66 on 1 and 298 DF, p-value: 1.769e-15
Then, we can see the fitted regression line overlay with data
plot(dta$height, dta$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
main = "Scatter Plot between Height and Weight by Gender",
pch = as.numeric(dta$male), col = as.factor(dta$male))
abline(fit, col = "orange", lwd = 3)
There are four assumptions associated with a linear regression model:
I check the model fit by plotting:
And to look at the outlier and leverage via
par(mfrow=c(2,2))
plot(fit)
Lastly, we look at the influential points
<- dim(dta)[1]
n <- cooks.distance(fit)
cooksD #identify influential points
<- as.numeric(names(cooksD)[(cooksD > (4/n))])) (influential_obs
[1] 2 117 134 140 163 164 169 172 212 233 260 263
#plot cooksD
plot(cooksD, pch="*", cex=2, main="Influential Obs by Cooks distance")
abline(h = 4/n, col="red") # add cutoff line
text(x=1:length(cooksD), y=cooksD, labels=ifelse((cooksD>(4/n)),names(cooksD),""), col="red", pos = 4)
Till now, we can say that the linear regression assumption is violated in this case, e.g. error is not following the normal distribution. Therefore, how about we delete the influential points and re-fit the model:
<- dta[-influential_obs, ]
dta.outliers_removed
<- lm(dta.outliers_removed$weight ~ dta.outliers_removed$height)
fit.outliers_removed summary(fit.outliers_removed)
Call:
lm(formula = dta.outliers_removed$weight ~ dta.outliers_removed$height)
Residuals:
Min 1Q Median 3Q Max
-55.16 -18.87 -3.18 16.55 83.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -155.9116 26.8300 -5.811 1.65e-08 ***
dta.outliers_removed$height 4.7112 0.4032 11.685 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.36 on 286 degrees of freedom
Multiple R-squared: 0.3231, Adjusted R-squared: 0.3208
F-statistic: 136.5 on 1 and 286 DF, p-value: < 2.2e-16
Somehow, we saw the model assumption is satisfied when the influential cases removed.
par(mfrow=c(1,2))
plot(fit.outliers_removed, which=c(1,2))
The regression line changes when we remove the influential observations. Dangerous!!!
par(mfrow=c(1,2))
plot(dta$height, dta$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
main = "With Outliers")
abline(fit, col = "orange", lwd = 3)
plot(dta.outliers_removed$height, dta.outliers_removed$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
main = "Outliers removed")
abline(fit.outliers_removed, col = "orange", lwd = 3)
But, the action of deleting of influential cases is often not the solution due to produce the bias estimates.
Through this case, we have reviewed:
Influential points = Outliers & Leverage
A point that makes a lot of difference in a regression case, is called ‘an influential point’. Usually influential points have two characteristics:
\[ y = \beta_0 + \beta_1 x + \epsilon \]
With Bayesian approach distribution of \(\epsilon\) does not have to be Gaussian (normal), we are going to use robust assumption.
Bayes theorem for this model:
\[ p(\beta_0, \beta_1, \sigma, \gamma \mid D) = \frac{p(D \mid \beta_0, \beta_1, \sigma, \gamma) \ p(\beta_0, \beta_1, \sigma, \gamma)}{\int \int \int \int p(D \mid \beta_0, \beta_1, \sigma, \gamma) \ p(\beta_0, \beta_1, \sigma, \gamma) \ d\beta_0 \ d\beta_1 \ d\sigma \ d\gamma} \]
Create the data list.
<- dta$weight
y <- dta$height
x <- list(x = x, y = y) dataList
Based on the Normal distribution (demonstrating purpose, not run)
= "
modstring_norm # Specify the Normal model for none-standardized data:
model {
for (i in 1:Ntotal) {
y[i] ~ dnorm(mu[i], prec)
mu[i] = b[1] + b[2]*log_income[i]
}
for (i in 1:2) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
Based on the Student-t distribution, robust assumption
= "
modelString # Standardize the data:
data {
Ntotal <- length(y)
xm <- mean(x)
ym <- mean(y)
xsd <- sd(x)
ysd <- sd(y)
for ( i in 1:length(y) ) {
zx[i] <- (x[i] - xm) / xsd
zy[i] <- (y[i] - ym) / ysd
}
}
# Specify the model for standardized data:
model {
for ( i in 1:Ntotal ) {
zy[i] ~ dt( zbeta0 + zbeta1 * zx[i] , 1/zsigma^2 , nu )
}
# Priors vague on standardized scale:
zbeta0 ~ dnorm(0, 1/(10)^2 )
zbeta1 ~ dnorm(0, 1/(10)^2 )
zsigma ~ dunif(1.0E-3, 1.0E+3 )
nu ~ dexp(1/30.0)
# Transform to original scale:
beta1 <- zbeta1 * ysd / xsd
beta0 <- zbeta0 * ysd + ym - zbeta1 * xm * ysd / xsd
sigma <- zsigma * ysd
}
"
# Write out modelString to a text file
writeLines(modelString, con="TEMPmodel.txt")
In the tutorial, we just want to execute the model specified by t-student distribution.
Every arrow has a corresponding line in the descriptive
diagram.
Variable names starting with “z
” mean that these
variables are standardized (z-scores
).
Strong correlation creates thin and long shape on scatter-plot of the variables which makes Gibbs sampling very slow and inefficient.
But remember to scale back to the original measures.
HMC implemented in Stan does not have this problem. This can be applied to STAN in all situation !!!
= c("beta0" , "beta1" , "sigma",
parameters "zbeta0" , "zbeta1" , "zsigma", "nu")
= 500 # Number of steps to "tune" the samplers
adaptSteps = 1000
burnInSteps = 4
nChains = 1
thinSteps =20000
numSavedSteps= ceiling((numSavedSteps*thinSteps ) / nChains )
nIter = jags.model("TEMPmodel.txt", data=dataList ,
jagsModel n.chains=nChains, n.adapt=adaptSteps)
update(jagsModel, n.iter=burnInSteps)
= coda.samples(jagsModel, variable.names=parameters,
codaSamples n.iter=nIter, thin=thinSteps)
summary(codaSamples)
Iterations = 1501:6500
Thinning interval = 1
Number of chains = 4
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
beta0 -139.96225 27.61760 0.1952859 0.2180717
beta1 4.46120 0.41330 0.0029225 0.0032290
nu 5.37901 1.62796 0.0115114 0.0236101
sigma 23.97943 1.65132 0.0116766 0.0204363
zbeta0 -0.09632 0.04806 0.0003398 0.0004283
zbeta1 0.49046 0.04544 0.0003213 0.0003550
zsigma 0.68372 0.04708 0.0003329 0.0005827
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
beta0 -193.3580 -158.7464 -140.16434 -121.3445 -8.624e+01
beta1 3.6527 4.1830 4.46491 4.7407 5.257e+00
nu 3.1509 4.2578 5.05987 6.1485 9.410e+00
sigma 20.8323 22.8476 23.94254 25.0519 2.728e+01
zbeta0 -0.1888 -0.1282 -0.09682 -0.0644 -4.115e-04
zbeta1 0.4016 0.4599 0.49087 0.5212 5.779e-01
zsigma 0.5940 0.6514 0.68267 0.7143 7.779e-01
plot(codaSamples, trace=TRUE, density=FALSE) # note: many graphs
autocorr.plot(codaSamples, ask=F)
effectiveSize(codaSamples)
beta0 beta1 nu sigma zbeta0 zbeta1 zsigma
16054.484 16401.066 4768.900 6535.965 12892.980 16401.066 6535.965
#gelman.diag(codaSamples)
gelman.plot(codaSamples) # lines may return error ==> Most likely reason: collinearity of parameters
<- lapply(codaSamples, function(z) cbind(Mu = hdi(codaSamples[[1]][,1]), Sd = hdi(codaSamples[[1]][,2])))) (HDIofChains
[[1]]
var1 var1
lower -192.46444 3.660595
upper -85.49078 5.257061
[[2]]
var1 var1
lower -192.46444 3.660595
upper -85.49078 5.257061
[[3]]
var1 var1
lower -192.46444 3.660595
upper -85.49078 5.257061
[[4]]
var1 var1
lower -192.46444 3.660595
upper -85.49078 5.257061
Look at strong correlation between beta0 and beta1 which slows Gibbs sampling down.
head(as.matrix(codaSamples[[1]]))
beta0 beta1 nu sigma zbeta0 zbeta1
[1,] -168.7337 4.890447 5.623225 23.90661 -0.10407253 0.5376553
[2,] -158.6082 4.744414 6.371647 24.56065 -0.09181883 0.5216004
[3,] -166.9662 4.873701 6.346842 22.92533 -0.08537616 0.5358143
[4,] -161.6680 4.795882 3.209357 23.00506 -0.08162923 0.5272588
[5,] -160.5706 4.801212 3.153090 22.15810 -0.04024801 0.5278448
[6,] -155.9331 4.684888 4.394775 21.44653 -0.12823063 0.5150561
zsigma
[1,] 0.6816415
[2,] 0.7002899
[3,] 0.6536626
[4,] 0.6559359
[5,] 0.6317868
[6,] 0.6114980
pairs(as.matrix(codaSamples[[1]])[,1:4])
In order to give a vague priors to slope and intercept consider the following arguments:
\[ \frac{\sigma_y}{\sigma_x} \]
when variables \(x\) and \(y\) are perfectly correlated.
Then standard deviation of the slope parameter \(\beta_1\) should be large enough to make the maximum value easily achievable.
\[ E[X] \frac{\sigma_y}{\sigma_x} \]
So, the prior should have enough width to include this value.
= "
modelString data {
int<lower=1> Ntotal;
real x[Ntotal];
real y[Ntotal];
real meanY;
real sdY;
real meanX;
real sdX;
}
transformed data {
real unifLo;
real unifHi;
real expLambda;
real beta0sigma;
real beta1sigma;
unifLo = sdY/1000;
unifHi = sdY*1000;
expLambda = 1/30.0;
beta1sigma = 10*fabs(sdY/sdX);
beta0sigma = 10*(sdY^2+sdX^2) / 10*fabs(meanX*sdY/sdX);
}
parameters {
real beta0;
real beta1;
real<lower=0> nu;
real<lower=0> sigma;
}
model {
sigma ~ uniform(unifLo, unifHi);
nu ~ exponential(expLambda);
beta0 ~ normal(0, beta0sigma);
beta1 ~ normal(0, beta1sigma);
for (i in 1:Ntotal) {
y[i] ~ student_t(nu, beta0 + beta1 * x[i], sigma);
}
}
"
<- stan_model(model_code=modelString) stanDsoRobustReg
<-list(Ntotal=length(dta$weight),
daty=dta$weight,
meanY=mean(dta$weight),
sdY=sd(dta$weight),
x=dta$height,
meanX=mean(dta$height),
sdX=sd(dta$height))
<- sampling(stanDsoRobustReg,
fitSimRegStan data=dat,
pars=c('beta0', 'beta1', 'nu', 'sigma'),
iter=5000, chains = 4, cores = 4)
Save the fitted object.
print(fitSimRegStan)
Inference for Stan model: 99bd7b261ff9240bf0c6d7b1b21831d6.
4 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=10000.
mean se_mean sd 2.5% 25% 50% 75%
beta0 -139.35 0.50 27.89 -194.81 -157.61 -139.51 -120.74
beta1 4.45 0.01 0.42 3.64 4.17 4.45 4.72
nu 5.37 0.03 1.61 3.15 4.25 5.07 6.16
sigma 23.98 0.03 1.66 20.81 22.84 23.96 25.07
lp__ -1265.00 0.02 1.43 -1268.55 -1265.74 -1264.68 -1263.94
97.5% n_eff Rhat
beta0 -85.12 3072 1
beta1 5.28 3117 1
nu 9.39 3420 1
sigma 27.31 3579 1
lp__ -1263.20 3332 1
Samples were drawn using NUTS(diag_e) at Thu Apr 22 15:10:38 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
plot(fitSimRegStan)
::traceplot(fitSimRegStan, ncol=1, inc_warmup=F) rstan
pairs(fitSimRegStan, pars=c('nu','beta0','beta1','sigma'))
stan_scat(fitSimRegStan, c('beta0','beta1'))
stan_scat(fitSimRegStan, c('beta1','sigma'))
stan_scat(fitSimRegStan, c('beta0','sigma'))
stan_scat(fitSimRegStan, c('nu','sigma'))
stan_dens(fitSimRegStan)
stan_ac(fitSimRegStan, separate_chains = T)
stan_diag(fitSimRegStan,information = "sample",chain=0)
stan_diag(fitSimRegStan,information = "stepsize",chain = 0)
stan_diag(fitSimRegStan,information = "treedepth",chain = 0)
stan_diag(fitSimRegStan,information = "divergence",chain = 0)
Work with shinystan
object.
launch_shinystan(fitSimRegStan)
Recall that the data in this example contains predictor height and
output weight for a group of people from Ht-Wt.csv
(data
above).
Plot all heights observed in the sample and check the summary of the variable.
plot(1:length(dat$x),dat$x)
summary(dat$x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
54.60 64.00 66.20 66.39 69.20 76.00
Can we predict weight of a person who is 50 or 80 inches tall?
To do this we can go through all pairs of simulated parameters (\(\beta_0\), \(\beta_1\)) and use them to simulate \(y(50)\) and \(y(80)\).
This gives distribution of predicted values.
summary(fitSimRegStan)
$summary
mean se_mean sd 2.5% 25%
beta0 -139.354196 0.503202352 27.8915373 -194.808900 -157.605829
beta1 4.451593 0.007474065 0.4173043 3.636611 4.172879
nu 5.366643 0.027556563 1.6115181 3.151177 4.246276
sigma 23.982791 0.027670259 1.6554405 20.814300 22.838675
lp__ -1265.002519 0.024734258 1.4276951 -1268.552672 -1265.738311
50% 75% 97.5% n_eff Rhat
beta0 -139.512830 -120.737878 -85.121838 3072.271 1.000770
beta1 4.454489 4.723609 5.282330 3117.396 1.000699
nu 5.069790 6.157253 9.391004 3419.954 1.000994
sigma 23.957637 25.072876 27.313013 3579.322 1.000550
lp__ -1264.682923 -1263.938424 -1263.195323 3331.756 1.000423
$c_summary
, , chains = chain:1
stats
parameter mean sd 2.5% 25%
beta0 -140.496987 27.9128445 -195.452362 -159.028578
beta1 4.468322 0.4176474 3.643375 4.180768
nu 5.333266 1.6129391 3.146164 4.239934
sigma 23.996371 1.6756857 20.745408 22.833622
lp__ -1264.973734 1.3931068 -1268.262683 -1265.705478
stats
parameter 50% 75% 97.5%
beta0 -140.690732 -121.306242 -85.591653
beta1 4.473340 4.745248 5.301112
nu 5.036853 6.074138 9.402810
sigma 23.966000 25.101448 27.242245
lp__ -1264.658401 -1263.936454 -1263.201543
, , chains = chain:2
stats
parameter mean sd 2.5% 25%
beta0 -140.119919 27.7217798 -195.188356 -158.408294
beta1 4.462118 0.4146886 3.668582 4.177839
nu 5.396927 1.5988486 3.155521 4.277927
sigma 23.990640 1.6168350 21.056831 22.846187
lp__ -1264.946317 1.4072435 -1268.606024 -1265.657309
stats
parameter 50% 75% 97.5%
beta0 -140.000666 -121.163812 -86.601618
beta1 4.462406 4.731267 5.288368
nu 5.129432 6.181671 9.348566
sigma 23.918413 25.056950 27.345459
lp__ -1264.644540 -1263.891813 -1263.183147
, , chains = chain:3
stats
parameter mean sd 2.5% 25%
beta0 -137.758043 28.0473599 -193.976570 -156.709899
beta1 4.428479 0.4196264 3.620966 4.142890
nu 5.273822 1.5474425 3.130205 4.160287
sigma 23.905531 1.6471567 20.672028 22.808588
lp__ -1265.023664 1.3965268 -1268.555138 -1265.759590
stats
parameter 50% 75% 97.5%
beta0 -137.963462 -118.57774 -83.786093
beta1 4.430414 4.70807 5.278970
nu 4.999793 6.07770 9.087928
sigma 23.906472 24.98220 27.219559
lp__ -1264.712841 -1263.99760 -1263.196266
, , chains = chain:4
stats
parameter mean sd 2.5% 25%
beta0 -139.041834 27.8184750 -193.491107 -156.858089
beta1 4.447452 0.4163603 3.625069 4.186441
nu 5.462557 1.6789159 3.209953 4.299050
sigma 24.038623 1.6794892 20.805817 22.868034
lp__ -1265.066361 1.5085815 -1268.785015 -1265.858500
stats
parameter 50% 75% 97.5%
beta0 -139.288564 -121.639438 -83.861266
beta1 4.449144 4.707789 5.268918
nu 5.106526 6.256394 9.576797
sigma 24.014382 25.132032 27.449061
lp__ -1264.738186 -1263.933356 -1263.202000
<-cbind(Beta0=rstan::extract(fitSimRegStan,pars="beta0")$'beta0',
regParamBeta1=rstan::extract(fitSimRegStan,pars="beta1")$'beta1')
head(regParam)
Beta0 Beta1
[1,] -174.7224 4.962250
[2,] -118.6827 4.113129
[3,] -152.0820 4.701561
[4,] -147.0165 4.564773
[5,] -159.9549 4.751269
[6,] -126.6841 4.251516
<-apply(regParam,1,function(z) z%*%c(1,50))
predX50<-apply(regParam,1,function(z) z%*%c(1,80)) predX80
Plot both distributions, look at their summaries and HDIs.
suppressWarnings(library(HDInterval))
<-density(predX50)
denplot(density(predX80),xlim=c(60,240))
lines(den$x,den$y)
summary(cbind(predX50,predX80))
predX50 predX80
Min. : 55.42 Min. :195.2
1st Qu.: 78.48 1st Qu.:212.9
Median : 83.19 Median :216.8
Mean : 83.23 Mean :216.8
3rd Qu.: 87.97 3rd Qu.:220.6
Max. :111.77 Max. :240.3
rbind(predX50=hdi(predX50),predX80=hdi(predX80))
lower upper
predX50 69.09173 97.20656
predX80 205.32725 228.15029
Both JAGS and Stan produced the identical results.
plot(dta$height, dta$weight, ylab = "Weight (lbs)", xlab = "Height (inches)",
main = "With Outliers")
abline(fit, col = "orange", lwd = 2)
abline(a=-139.96225, b= 4.46120, col = "blue", lwd = 1)
abline(fit.outliers_removed, col = "red", lty="dashed", lwd =1)
With influential | Without influential | Bayesian approach w/ robust assumption | |
---|---|---|---|
orange, solid | red, dashed | blue, solid | |
Intercept | -104.78 | -155.91 | -139.96 |
Slope | 3.98 | 4.71 | 4.46 |
Great! From the comparison, we can see that using Bayesian Methods to fit simple linear regression can be robust when the traditional regression has the influential points.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2021, April 22). HaiBiostat: Fitting a Simple Linear Regression in Bayesian Context. Retrieved from https://hai-mn.github.io/posts/2021-04-22-Simple-linear-regression-in-Bayesian-way/
BibTeX citation
@misc{nguyen2021fitting, author = {Nguyen, Hai}, title = {HaiBiostat: Fitting a Simple Linear Regression in Bayesian Context}, url = {https://hai-mn.github.io/posts/2021-04-22-Simple-linear-regression-in-Bayesian-way/}, year = {2021} }