library(tidyverse)
## ── Attaching packages ──────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.1     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Loading Data

Remember to set stringsAsFactors = FALSE when you load data. You can also load data from Files -> Import Dataset -> From Text (base).

barium <- read.delim("../Lab12/barium.txt", comment.char="#", stringsAsFactors=FALSE)
names(barium) <- c("temp", "pressure")

Specify the response variable as \(\log(\text{pressure})\), and the predictor variable as \(\frac{1}{\text{temperature}}\). You can use either baseR or dplyr.

# dplyr 
barium <- barium %>%
  mutate(lnpressure = log(pressure)) %>%
  mutate(invtemp = temp^(-1))

# baseR
# barium$lnpressure <- log(barium$pressure)
# barium$invtemp <- (barium$temp)^(-1)

Now plot the second law of thermodynamics, and see if there is indeed any linear relationship.

ggplot(barium, aes(x = invtemp, y = lnpressure)) +
  geom_point() +
  geom_smooth(method = "lm") +
  xlab("inverted temperature") +
  ylab("ln(pressure)") +
  ggtitle("second law of thermodynamics") +
  theme_classic()

summary(barium[, c("lnpressure", "invtemp")])
##    lnpressure         invtemp         
##  Min.   :-10.766   Min.   :0.0008696  
##  1st Qu.: -7.107   1st Qu.:0.0009238  
##  Median : -2.852   Median :0.0009888  
##  Mean   : -4.174   Mean   :0.0010517  
##  3rd Qu.: -1.392   3rd Qu.:0.0011838  
##  Max.   : -0.323   Max.   :0.0013550
linear_model <- lm(lnpressure ~ invtemp, data = barium)
result <- summary(linear_model)
result
## 
## Call:
## lm(formula = lnpressure ~ invtemp, data = barium)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.158596 -0.087968  0.004495  0.061145  0.293036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.818e+01  1.419e-01   128.2   <2e-16 ***
## invtemp     -2.126e+04  1.335e+02  -159.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1164 on 30 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 2.538e+04 on 1 and 30 DF,  p-value: < 2.2e-16

1. Formula Call

result$call
## lm(formula = lnpressure ~ invtemp, data = barium)

The formula shows the formula uses to fit the data, where the response variable is lnpressure, the predictor is invtemp, and the data we use to fit the model is barium.

2. Residuals

summary(result$residuals)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.158596 -0.087968  0.004495  0.000000  0.061145  0.293036

The residual error is \(\hat{e}_i = y_i - \hat{y}_i\), which is an estimator of \(e_i\). Here, we have the summary statistics of the residuals, including the minimal, first quantile, median, third quantile, and the maximal values.

3. Coefficients

coef(result)
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)     18.1847   0.1418549  128.1922 1.172782e-42
## invtemp     -21259.6919 133.4539677 -159.3036 1.747268e-45
result$coefficients
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)     18.1847   0.1418549  128.1922 1.172782e-42
## invtemp     -21259.6919 133.4539677 -159.3036 1.747268e-45
class(coef(result))
## [1] "matrix"
class(result$coefficients)
## [1] "matrix"

The coefficients section shows the estimated values of the coefficients in the linear model, as well as their standard error, t-value, and p-value.

The t-value can be used to test whether the corresponding coefficient is zero or not. If the p-value is smaller than 0.05, we can reject the null hypothesis.

\[\frac{\hat{\beta}_i - \beta_i}{s_{\hat{\beta}_i}} = \frac{\hat{\beta}_i - \beta_i}{\sqrt{\frac{\text{RSS}/(n-(k+1))}{n\text{Var}(X_i)(1-R_i^2)}}} \sim t_{n-(k+1)}\]

4. Residual standard error

result$sigma
## [1] 0.1163759
RSS <- sum( (result$residuals)^2 )
sqrt(RSS / (nrow(barium)-2) ) # RSS/(n-2)
## [1] 0.1163759
round(result$sigma, 4) # this is the displayed residual standard error in the linear model output
## [1] 0.1164

The residual standard error is the square root of the residual sum of squares divided by the residual degrees of freedom, i.e., \(\frac{\text{RSS}}{n-2}\). Notice that df (degree of freedom) of residual is \(n-2\) in simple linear regression, and it is \(n-(k+1)\) in multiple linear regression scenario. Thus, in the multiple linear regression case, we need to devide by \(n-(k+1)\), where \(k\) is the number of predictors in the linear model.

The residual standard error is used to measure how close the linear regression fitted values \(\hat{y}\) are to the data.

5. R Squared

result$r.squared
## [1] 0.9988193
TSS <- sum((barium$lnpressure - mean(barium$lnpressure))^2)
1 - RSS/TSS
## [1] 0.9988193

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}\] R Squared \(R^2\) is also called the coefficient of determination. It measures the proportion of the variance in the response variable that is explained by the predictor variables. \(R^2\) is between 0 and 1. The larger \(R^2\) is, the better the model fits to the data.

However, as the number of predictor variables increase in our linear model, \(R^2\) must at least increase accordingly. To reduce such bias, we introduce the adjusted R squared. Denote the number of predictors as \(k\), \[\text{adjusted } R^2 = 1 - \frac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}\]

result$adj.r.squared
## [1] 0.9987799

6. F statistic

\[F_{k, n-(k+1)} = \frac{R^2/k}{(1-R^2)/(n-(k+1))} = \frac{R^2}{1-R^2}\cdot\frac{n-(k+1)}{k}\] The F statistic can be used to test whether there is at least one predictor variable whose coefficient is nonzero.