library(tidyverse)
## ── Attaching packages ──────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.1 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Remember to set stringsAsFactors = FALSE when you load data. You can also load data from Files -> Import Dataset -> From Text (base).
barium <- read.delim("../Lab12/barium.txt", comment.char="#", stringsAsFactors=FALSE)
names(barium) <- c("temp", "pressure")
Specify the response variable as \(\log(\text{pressure})\), and the predictor variable as \(\frac{1}{\text{temperature}}\). You can use either baseR or dplyr.
# dplyr
barium <- barium %>%
mutate(lnpressure = log(pressure)) %>%
mutate(invtemp = temp^(-1))
# baseR
# barium$lnpressure <- log(barium$pressure)
# barium$invtemp <- (barium$temp)^(-1)
Now plot the second law of thermodynamics, and see if there is indeed any linear relationship.
ggplot(barium, aes(x = invtemp, y = lnpressure)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("inverted temperature") +
ylab("ln(pressure)") +
ggtitle("second law of thermodynamics") +
theme_classic()
summary(barium[, c("lnpressure", "invtemp")])
## lnpressure invtemp
## Min. :-10.766 Min. :0.0008696
## 1st Qu.: -7.107 1st Qu.:0.0009238
## Median : -2.852 Median :0.0009888
## Mean : -4.174 Mean :0.0010517
## 3rd Qu.: -1.392 3rd Qu.:0.0011838
## Max. : -0.323 Max. :0.0013550
linear_model <- lm(lnpressure ~ invtemp, data = barium)
result <- summary(linear_model)
result
##
## Call:
## lm(formula = lnpressure ~ invtemp, data = barium)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.158596 -0.087968 0.004495 0.061145 0.293036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.818e+01 1.419e-01 128.2 <2e-16 ***
## invtemp -2.126e+04 1.335e+02 -159.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1164 on 30 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 2.538e+04 on 1 and 30 DF, p-value: < 2.2e-16
result$call
## lm(formula = lnpressure ~ invtemp, data = barium)
The formula shows the formula uses to fit the data, where the response variable is lnpressure, the predictor is invtemp, and the data we use to fit the model is barium.
summary(result$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.158596 -0.087968 0.004495 0.000000 0.061145 0.293036
The residual error is \(\hat{e}_i = y_i - \hat{y}_i\), which is an estimator of \(e_i\). Here, we have the summary statistics of the residuals, including the minimal, first quantile, median, third quantile, and the maximal values.
coef(result)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.1847 0.1418549 128.1922 1.172782e-42
## invtemp -21259.6919 133.4539677 -159.3036 1.747268e-45
result$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.1847 0.1418549 128.1922 1.172782e-42
## invtemp -21259.6919 133.4539677 -159.3036 1.747268e-45
class(coef(result))
## [1] "matrix"
class(result$coefficients)
## [1] "matrix"
The coefficients section shows the estimated values of the coefficients in the linear model, as well as their standard error, t-value, and p-value.
The t-value can be used to test whether the corresponding coefficient is zero or not. If the p-value is smaller than 0.05, we can reject the null hypothesis.
\[\frac{\hat{\beta}_i - \beta_i}{s_{\hat{\beta}_i}} = \frac{\hat{\beta}_i - \beta_i}{\sqrt{\frac{\text{RSS}/(n-(k+1))}{n\text{Var}(X_i)(1-R_i^2)}}} \sim t_{n-(k+1)}\]
result$sigma
## [1] 0.1163759
RSS <- sum( (result$residuals)^2 )
sqrt(RSS / (nrow(barium)-2) ) # RSS/(n-2)
## [1] 0.1163759
round(result$sigma, 4) # this is the displayed residual standard error in the linear model output
## [1] 0.1164
The residual standard error is the square root of the residual sum of squares divided by the residual degrees of freedom, i.e., \(\frac{\text{RSS}}{n-2}\). Notice that df (degree of freedom) of residual is \(n-2\) in simple linear regression, and it is \(n-(k+1)\) in multiple linear regression scenario. Thus, in the multiple linear regression case, we need to devide by \(n-(k+1)\), where \(k\) is the number of predictors in the linear model.
The residual standard error is used to measure how close the linear regression fitted values \(\hat{y}\) are to the data.
result$r.squared
## [1] 0.9988193
TSS <- sum((barium$lnpressure - mean(barium$lnpressure))^2)
1 - RSS/TSS
## [1] 0.9988193
\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}\] R Squared \(R^2\) is also called the coefficient of determination. It measures the proportion of the variance in the response variable that is explained by the predictor variables. \(R^2\) is between 0 and 1. The larger \(R^2\) is, the better the model fits to the data.
However, as the number of predictor variables increase in our linear model, \(R^2\) must at least increase accordingly. To reduce such bias, we introduce the adjusted R squared. Denote the number of predictors as \(k\), \[\text{adjusted } R^2 = 1 - \frac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}\]
result$adj.r.squared
## [1] 0.9987799
\[F_{k, n-(k+1)} = \frac{R^2/k}{(1-R^2)/(n-(k+1))} = \frac{R^2}{1-R^2}\cdot\frac{n-(k+1)}{k}\] The F statistic can be used to test whether there is at least one predictor variable whose coefficient is nonzero.