# ~/Week 4

## Brandon Rozek PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.

## Exponential Data

Suppose you’re waiting for a bus that you think comes on average once every 10 minutes, but you’re not sure exactly how often it comes. $$Y \sim Exp(\lambda)$$ Your waiting time has a prior expectation of $\frac{1}{\lambda}$

It turns out the gamma distribution is conjugate for an exponential likelihood. We need to specify a prior, or a particular gamma in this case. If we think that the buses come on average every ten minutes, that’s a rate of one over ten. $$prior_{mean} = \frac{1}{10}$$ Thus, we’ll want to specify a gamma distribution so that the first parameter divded by the second parameter is $\frac{1}{10}$

We can now think about our variability. Perhaps you specify $$\Gamma(100, 1000)$$ This will indeed have a prior mean of $\frac{1}{10}$ and it’ll have a standard deviation of $\frac{1}{100}$. If you want to have a rough estimate of our mean plus or minus two standard deviations then we have the following $$0.1 \pm 0.02$$ Suppose that we wait for 12 minutes and a bus arrives. Now you want to update your posterior for $\lambda$ about how often this bus will arrive. $$f(\lambda | y) \propto f(y|\lambda)f(\lambda)$$

$$f(\lambda | y) \propto \lambda e^{-\lambda y}\lambda^{\alpha - 1}e^{-\beta \lambda}$$

$$f(\lambda | y) \propto \lambda^{(\alpha + 1) - 1}e^{-(\beta + y)\lambda}$$

$$\lambda | y \sim \Gamma(\alpha + 1, \beta + y)$$

## Linear Regression

### Single Variable Regression

We’ll be looking at the Challenger dataset. It contains 23 past launches where it has the temperature at the day of launch and the O-ring damage index

oring=read.table("http://www.randomservices.org/random/data/Challenger2.txt",
# Note that attaching this masks T which is originally TRUE
attach(oring)


Now we’ll see the plot

plot(T, I) Fit a linear model

oring.lm=lm(I~T)
summary(oring.lm)


Output of the summary

Call:
lm(formula = I ~ T)

Residuals:
Min      1Q  Median      3Q     Max
-2.3025 -1.4507 -0.4928  0.7397  5.5337

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.36508    4.43859   4.138 0.000468
T           -0.24337    0.06349  -3.833 0.000968

(Intercept) ***
T           ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.102 on 21 degrees of freedom
Multiple R-squared:  0.4116,	Adjusted R-squared:  0.3836
F-statistic: 14.69 on 1 and 21 DF,  p-value: 0.0009677


Add the fitted line into the scatterplot

lines(T,fitted(oring.lm)) Create a 95% posterior interval for the slope

-0.24337 - 0.06349*qt(.975,21)
#  -0.3754047
-0.24337 + 0.06349*qt(.975,21)
#  -0.1113353


Note: These are the same as the frequentest confidence intervals

If the challenger launch was at 31 degrees Fahrenheit, how much O-Ring damage would we predict?

coef(oring.lm) + coef(oring.lm)*31
#  10.82052


Let’s make our posterior prediction interval

predict(oring.lm,data.frame(T=31),interval="predict")


Output of predict

       fit      lwr      upr
1 10.82052 4.048269 17.59276


We can calculate the lower bound through the following formula

10.82052-2.102*qt(.975,21)*sqrt(1+1/23+((31-mean(T))^2/22/var(T)))


What’s the posterior probability that the damage index is greater than zero?

1-pt((0-10.82052)/(2.102*sqrt(1+1/23+((31-mean(T))^2/22/var(T)))),21)


### Multivariate Regression

We’re looking at Galton’s seminal data predicting the height of children from the height of the parents

heights=read.table("http://www.randomservices.org/random/data/Galton.txt",
attach(heights)


What are the columns in the dataset?

names(heights)
#  "Family" "Father" "Mother" "Gender" "Height" "Kids"


Let’s look at the relationship between the different variables

pairs(heights) First let’s start by creating a linear model taking all of the columns into account

summary(lm(Height~Father+Mother+Gender+Kids))


Output of summary

Call:
lm(formula = Height ~ Father + Mother + Gender + Kids)

Residuals:
Min      1Q  Median      3Q     Max
-9.4748 -1.4500  0.0889  1.4716  9.1656

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.18771    2.79387   5.794 9.52e-09
Father       0.39831    0.02957  13.472  < 2e-16
Mother       0.32096    0.03126  10.269  < 2e-16
GenderM      5.20995    0.14422  36.125  < 2e-16
Kids        -0.04382    0.02718  -1.612    0.107

(Intercept) ***
Father      ***
Mother      ***
GenderM     ***
Kids
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.152 on 893 degrees of freedom
Multiple R-squared:  0.6407,	Adjusted R-squared:  0.6391
F-statistic: 398.1 on 4 and 893 DF,  p-value: < 2.2e-16


As you can see here, the Kids column is not significant. Let’s look at a model with it removed.

summary(lm(Height~Father+Mother+Gender))


Output of summary

Call:
lm(formula = Height ~ Father + Mother + Gender)

Residuals:
Min     1Q Median     3Q    Max
-9.523 -1.440  0.117  1.473  9.114

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.34476    2.74696   5.586 3.08e-08
Father       0.40598    0.02921  13.900  < 2e-16
Mother       0.32150    0.03128  10.277  < 2e-16
GenderM      5.22595    0.14401  36.289  < 2e-16

(Intercept) ***
Father      ***
Mother      ***
GenderM     ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.154 on 894 degrees of freedom
Multiple R-squared:  0.6397,	Adjusted R-squared:  0.6385
F-statistic:   529 on 3 and 894 DF,  p-value: < 2.2e-16


This model looks good, let’s go ahead and save it to a variable

heights.lm=lm(Height~Father+Mother+Gender)


From this we can tell that for each extra inch of height in a father is correlated with an extra 0.4 inches extra to the height of a child.

We can also tell that each extra inch of height in a mother is correlated with an extra 0.3 inches extra to the height of the child.

A male child is on average 5.2 inches taller than a female child.

Let’s create a 95% posterior interval for the difference in height by gender

5.226 - 0.144*qt(.975,894)
#  4.943383
5.226 + 0.144*qt(.975,894)
#  5.508617


Let’s make a posterior prediction interval for a male and female with a father whose 68 inches and a mother whose 64 inches.

predict(heights.lm,data.frame(Father=68,Mother=64,Gender="M"),
interval="predict")

#       fit      lwr     upr
# 1 68.75291 64.51971 72.9861

predict(heights.lm,data.frame(Father=68,Mother=64,Gender="F"),
interval="predict")

#       fit      lwr      upr
# 1 63.52695 59.29329 67.76062


## What’s next?

This concludes this course, if you want to go further with Bayesian statistics, the next topics to explore would be hierarchal modeling and fitting of non conjugate models with Markov Chain Monte Carlo or MCMC.