library(tidyverse)
## ── Attaching packages ──────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.1 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Remember to set stringsAsFactors = FALSE
when you load data. You can also load data from Files -> Import Dataset -> From Text (base).
barium <- read.delim("../Lab12/barium.txt", comment.char="#", stringsAsFactors=FALSE)
names(barium) <- c("temp", "pressure")
Specify the response variable as \(\log(\text{pressure})\), and the predictor variable as \(\frac{1}{\text{temperature}}\). You can use either baseR or dplyr.
# dplyr
barium <- barium %>%
mutate(lnpressure = log(pressure)) %>%
mutate(invtemp = temp^(-1))
# baseR
# barium$lnpressure <- log(barium$pressure)
# barium$invtemp <- (barium$temp)^(-1)
Now plot the second law of thermodynamics, and see if there is indeed any linear relationship.
ggplot(barium, aes(x = invtemp, y = lnpressure)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("inverted temperature") +
ylab("ln(pressure)") +
ggtitle("second law of thermodynamics") +
theme_classic()
summary(barium[, c("lnpressure", "invtemp")])
## lnpressure invtemp
## Min. :-10.766 Min. :0.0008696
## 1st Qu.: -7.107 1st Qu.:0.0009238
## Median : -2.852 Median :0.0009888
## Mean : -4.174 Mean :0.0010517
## 3rd Qu.: -1.392 3rd Qu.:0.0011838
## Max. : -0.323 Max. :0.0013550
linear_model <- lm(lnpressure ~ invtemp, data = barium)
result <- summary(linear_model)
result
##
## Call:
## lm(formula = lnpressure ~ invtemp, data = barium)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.158596 -0.087968 0.004495 0.061145 0.293036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.818e+01 1.419e-01 128.2 <2e-16 ***
## invtemp -2.126e+04 1.335e+02 -159.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1164 on 30 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 2.538e+04 on 1 and 30 DF, p-value: < 2.2e-16
result$call
## lm(formula = lnpressure ~ invtemp, data = barium)
The formula shows the formula uses to fit the data, where the response variable is lnpressure
, the predictor is invtemp
, and the data we use to fit the model is barium
.
summary(result$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.158596 -0.087968 0.004495 0.000000 0.061145 0.293036
The residual error is \(\hat{e}_i = y_i - \hat{y}_i\), which is an estimator of \(e_i\). Here, we have the summary statistics of the residuals, including the minimal, first quantile, median, third quantile, and the maximal values.
coef(result)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.1847 0.1418549 128.1922 1.172782e-42
## invtemp -21259.6919 133.4539677 -159.3036 1.747268e-45
result$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.1847 0.1418549 128.1922 1.172782e-42
## invtemp -21259.6919 133.4539677 -159.3036 1.747268e-45
class(coef(result))
## [1] "matrix"
class(result$coefficients)
## [1] "matrix"
The coefficients section shows the estimated values of the coefficients in the linear model, as well as their standard error, t-value, and p-value.
The t-value can be used to test whether the corresponding coefficient is zero or not. If the p-value is smaller than 0.05, we can reject the null hypothesis.
\[\frac{\hat{\beta}_i - \beta_i}{s_{\hat{\beta}_i}} = \frac{\hat{\beta}_i - \beta_i}{\sqrt{\frac{\text{RSS}/(n-(k+1))}{n\text{Var}(X_i)(1-R_i^2)}}} \sim t_{n-(k+1)}\]
result$sigma
## [1] 0.1163759
RSS <- sum( (result$residuals)^2 )
sqrt(RSS / (nrow(barium)-2) ) # RSS/(n-2)
## [1] 0.1163759
round(result$sigma, 4) # this is the displayed residual standard error in the linear model output
## [1] 0.1164
The residual standard error is the square root of the residual sum of squares divided by the residual degrees of freedom, i.e., \(\frac{\text{RSS}}{n-2}\). Notice that df (degree of freedom) of residual is \(n-2\) in simple linear regression, and it is \(n-(k+1)\) in multiple linear regression scenario. Thus, in the multiple linear regression case, we need to devide by \(n-(k+1)\), where \(k\) is the number of predictors in the linear model.
The residual standard error is used to measure how close the linear regression fitted values \(\hat{y}\) are to the data.
result$r.squared
## [1] 0.9988193
TSS <- sum((barium$lnpressure - mean(barium$lnpressure))^2)
1 - RSS/TSS
## [1] 0.9988193
\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}\] R Squared \(R^2\) is also called the coefficient of determination. It measures the proportion of the variance in the response variable that is explained by the predictor variables. \(R^2\) is between 0 and 1. The larger \(R^2\) is, the better the model fits to the data.
However, as the number of predictor variables increase in our linear model, \(R^2\) must at least increase accordingly. To reduce such bias, we introduce the adjusted R squared. Denote the number of predictors as \(k\), \[\text{adjusted } R^2 = 1 - \frac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}\]
result$adj.r.squared
## [1] 0.9987799
\[F_{k, n-(k+1)} = \frac{R^2/k}{(1-R^2)/(n-(k+1))} = \frac{R^2}{1-R^2}\cdot\frac{n-(k+1)}{k}\] The F statistic can be used to test whether there is at least one predictor variable whose coefficient is nonzero.