class: center, middle, inverse, title-slide .title[ # MATH 204 Introduction to Statistics ] .subtitle[ ## Lecture 14: Intro to Regression ] .author[ ### JMG ] --- ## Goals for Lecture In this lecture, we introduce - simple linear regression via least squares, -- - consider the statistical inference for regression parameters, and -- - explain how to fit linear regression using R. --- ## Linear Regression: Introduction - Linear regression is a statistical method for fitting a line to data. -- - Recall that a (non-vertical) line in the `\(x,y\)`-plane is determined by `$$y = \text{slope} \times x + \text{intercept}$$` -- - There are two aspects to fitting a line to data that we will study: -- - Estimating the slope and intercept values, and -- - Assessing the uncertainty of our estimates for the slope and intercept values. -- - In this lecture, we cover all of the concepts necessary to understand how to carry out and interpret linear regression. -- - We encourage you to watch the video on the next slide to help in getting introduced to linear regression. --- ## Regression Intro Video
--- ## Learning Objectives - After this lecture, you should -- - Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2) -- - Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2) -- - Know how to diagnose problems with a linear fit by least squares regression. (8.3) -- - Understand the methods of inference of least squares regression. (8.4) -- - Know how to obtain a linear fit using R with the `lm` function. -- - Be able to assess and interpret the results of a linear fit using R with the `lm` function. --- ## Simple Regression Model - Simple linear regression models the relationship between two (numerical) variables `\(x\)` and `\(y\)` by the formula `$$y = \beta_{0} + \beta_{1} x + \epsilon$$` where -- - `\(\beta_{0}\)` (intercept) and `\(\beta_{1}\)` (slope) are the model **parameters** -- - `\(\epsilon\)` is the **error** -- - The parameters are estimated using data and their point estimates are denoted by `\(b_{0}\)` (intercept estimate) and `\(b_{1}\)` (slope estimate). -- - In linear regression, `\(x\)` is called the explanatory or predictor variable while `\(y\)` is called the response variable. -- - Let's look at an example data set for which linear regression is a potentially useful model. --- ## Australian Brushtail Possum .pull-left[ <img src="https://www.dropbox.com/s/5hxo31udpanhnle/opossum.jpg?raw=1" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ - The `possum` data set records measurements of 104 brushtail possums from Australia and New Guinea, the first few rows of the data are shown below ] ``` ## # A tibble: 4 × 8 ## site pop sex age head_l skull_w total_l tail_l ## <int> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> ## 1 1 Vic m 8 94.1 60.4 89 36 ## 2 1 Vic f 6 92.5 57.6 91.5 36.5 ## 3 1 Vic f 6 94 60 95.5 39 ## 4 1 Vic f 6 93.2 57.1 92 38 ``` --- ## Possum Data Example - Suppose that we as researchers are interested to study the relationship between the head length (`head_l`) and total length (`total_l`) measurements of the brushtail possum of Australia. -- - Note that head length (`head_l`) and total length (`total_l`) are both (continuous) numerical variables. -- - The next slide displays a scatterplot for `head_l` versus `total_l`. --- ## Possum Data Scatterplot <img src="index_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> -- - Describe the features of any association that there appears to be between the two variables in the plot. --- ## Possum Data Regression Line - We have added the "best fit" line to the scatter plot of `head_l` versus `total_l`. Later we discuss how this line is obtained. <img src="index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> -- - A **residual** is the vertical distance between a data point and the best fit line. The next slide shows all of the residuals for the data. --- ## Possum Regression Residuals <img src="index_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> -- - The best fit or regression line is the line that minimizes all of the residuals simultaneously. --- ## Possum Regression Residual Plot - A residual plot displays the residual values versus the `\(x\)` values from the data. <img src="index_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> -- - As we will see, residual plots play an important role in assessing the results of a regression model. --- ## Fits and Residual Plots - Let's look some linear fits and their corresponding residual plots. -- .center[ <img src="https://www.dropbox.com/s/1q360mf7pgyn7t7/sampleLinesAndResPlots.png?raw=1" width="70%" style="display: block; margin: auto;" /> ] -- - In the first column, the residuals show no obvious pattern, this is desirable. In the second column, the residuals show a pattern that suggests a linear model is inappropriate. In the third column, it's not clear if the linear fit is statistically significant. --- ## Correlation - **Correlation**, which always takes values between -1 and 1 is a statistic that describes the strength of the linear relationship between two variables. Correlation is denoted by `\(R\)`. -- - In R, correlation is computed with the `cor` command. For example, the correlation between the `head_l` and `total_l` variables in the `possum` data set is computed as .panelset[ .panel[.panel-name[Results] ``` ## [1] 0.6910937 ``` ] .panel[.panel-name[Code] ```r cor(possum$head_l,possum$total_l) ``` ] ] -- - The plot in the next slide shows several scatter plots together with the corresponding correlation value. --- ## Correlation Illustrations .center[ <img src="https://www.dropbox.com/s/jbgilwujdebjwwf/posNegCorPlots.png?raw=1" width="80%" style="display: block; margin: auto;" /> ] --- ## Strongly Related Variables with Weak Correlations - It is important to note that two variables may have a strong association even if their correlation is relatively weak. This is because correlation measures **linear** association and variables may be a strong **nonlinear** association. -- .center[ <img src="https://www.dropbox.com/s/rmujuxuoihdg5xb/corForNonLinearPlots.png?raw=1" width="100%" style="display: block; margin: auto;" /> ] --- ## Least Squares Regression - We now begin to discuss the details of how to fit a simple linear regression model to data. -- - The approach we take is called *least squares regression*. -- - The idea is to chose parameter estimates that minimize all of the residuals simultaneously. That is, for each observed data point `\((x_{i},y_{i})\)`, we find `\(b_{0}\)` and `\(b_{1}\)` such that if `\(\hat{y}_{i} = b_{0} + b_{1}x_{i}\)`, then `$$RSS = \sum_{i=1}^{n}(\hat{y}_{i} - y_{i})^2$$` is as small as possible. -- - For this to work out well, several conditions need to be met. These conditions are spelled out on the next slide. --- ## Conditions for Least Squares - **Linearity.** The data should show a linear trend. -- - **Normality.** Generally, the distribution of the residuals should be close to normal. -- - **Constant Variance.** The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition. -- - **Independence.** We want to avoid fitting a line to data via least squares whenever there is dependence between consecutive data points. -- - The next slide shows plot of data where at least one of the conditions for least squares regression fails to hold. --- ## Regression Assumption Failures .center[ <img src="https://www.dropbox.com/s/hu01jaaqx2ohbgq/whatCanGoWrongWithLinearModel.png?raw=1" width="100%" style="display: block; margin: auto;" /> ] -- - In the first column, the linearity condition fails. In the second column, the normality condition fails. -- - In the third column, the constant variance condition fails. In the fourth column, the independence condition fails. -- - Notice how in each case, the residual plot can be used to diagnose problems with a least squares linear regression fit. --- ## Fitting a Linear Model - For simple least squares linear regression with one predictor variable ( `\(x\)` ), one can fit the model to data "by hand". The mathematical formula is `$$b_{1} = \frac{s_{y}}{s_{x}} R,$$` `$$b_{0} = \bar{y} - b_{1} \bar{x},$$` where - `\(R\)` is the correlation between `\(x\)` and `\(y\)`, -- - `\(s_{y}\)` and `\(s_{x}\)` are the sample standard deviations for `\(y\)` and `\(x\)`, and -- - `\(\bar{y}\)` and `\(\bar{x}\)` are the sample means for `\(y\)` and `\(x\)`. -- - Let's apply these formulas to the `possum` data set with `\(x\)` the `total_l` variable and `\(y\)` the `head_l` variable. --- ## Applying the Regression Formulas - We need to compute the correlation, sample means, and sample standard deviations: .panelset[ .panel[.panel-name[Results] ] .panel[.panel-name[Code] ```r x <- possum$total_l; y <- possum$head_l x_bar <- mean(x); y_bar <- mean(y) s_x <- sd(x); s_y <- sd(y) R <- cor(x,y) ``` ] ] -- - Now we can compute our estimates `\(b_{0}\)` and `\(b_{1}\)`: .panelset[ .panel[.panel-name[Results] ``` ## [1] 0.5729013 ``` ``` ## [1] 42.70979 ``` ] .panel[.panel-name[Code] ```r (b_1 <- (s_y/s_x)*R) (b_0 <- y_bar - b_1*x_bar) ``` ] ] -- - There is an R command, `lm` (linear model) that will compute these values and much more for us. --- ## The lm Command - Let's see an example of how the `lm` command is used: .panelset[ .panel[.panel-name[Results] ``` ## ## Call: ## lm(formula = head_l ~ total_l, data = possum) ## ## Coefficients: ## (Intercept) total_l ## 42.7098 0.5729 ``` ] .panel[.panel-name[Code] ```r lm(head_l ~ total_l, data=possum) ``` ] ] -- - Notice that this returns the point estimate values for `\(b_{0}\)` (Intercept) and the slope `\(b_{1}\)`, and that these values are the same as what we obtained using the mathematical formulas on the last slide. --- ## Interpreting Model Parameters - For a linear model, -- - The slope describes the estimate difference in the `\(y\)` variable if the explanatory variable `\(x\)` for a case happened to be one unit larger. -- - The intercept describes the average outcome of `\(y\)` if `\(x=0\)` **and** the linear model is valid all the way to `\(x=0\)`, which in many applications is not the case. -- - To evaluate the strength of a linear fit, we compute `\(R^{2}\)` (R-squared). The value of `\(R^{2}\)` tells us the percent of variation in the response that is explained by the explanatory variable. -- - There are some pitfalls in interpreting the results of a linear model. In particular, -- - Applying a model to estimate values outside of the realm of the original data is called **extrapolation**. Generally, extrapolation is unreliable. -- - In many cases, even when there is a real association between variables, we cannot interpret a causal connection between the variables. --- ## Example - Before we discuss further details of regression, let's look at a detailed example of fitting and interpreting a linear model. -- - We will look at a linear model for the data set, `cheddar` from the `faraway` package. -- - Let's do this example together in RStudio. --- ## Outlier Issues - Outliers in regression are observations that fall far from the cloud of points. -- - Outliers can have a strong influence on the least squares line. -- - Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with **high leverage**. -- - A data point is called an **influential point** if, had we fitted the line without it, the influential point would have been unusually far from the least squares line. -- - The next slide shows data with outliers together with the corresponding regression line and residual plot. --- ## Regression Outliers .center[ <img src="https://www.dropbox.com/s/7t2qjdtwz7rkuok/outlierPlots.png?raw=1" width="40%" style="display: block; margin: auto;" /> ] --- ## Inference for Regression - Recall that a simple linear model has the form `$$y = \beta_{0} + \beta_{1} x + \epsilon$$` -- - Least squares is a method for obtaining point estimates `\(b_{0}\)` and `\(b_{1}\)` for the parameters `\(\beta_{0}\)` and `\(\beta_{1}\)`. Thus, `\(\beta_{0}\)` and `\(\beta_{1}\)` are unknowns that correspond to population values that we want to infer information about. -- - A somewhat subtle point is that we also do not know the population standard deviation `\(\sigma\)` for the error `\(\epsilon\)`. This is an additional model parameter. -- - We would like answers to the following questions: -- - How do we obtain confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)`, specifically how to we get the standard error? -- - How do we conduct hypothesis tests related to the parameters `\(\beta_{0}\)` and `\(\beta_{1}\)`? --- ## Confidence Intervals for Model Coefficients - **Fact:** The sampling distribution for estimates for `\(\beta_{0}\)` and `\(\beta_{1}\)` is a t-distribution. So we can obtain confidence intervals with `$$b_{i} \pm t^{\ast}_{\text{df}} \times SE_{b_{i}}$$` -- - All the numerical information you need to obtain confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)` is provided in the output of the `summary` command for a linear model fit with `lm`. -- - Suppose we fit a linear model for the `possum` data again: .panelset[ .panel[.panel-name[Results] ``` ## ## Call: ## lm(formula = head_l ~ total_l, data = possum) ## ## Coefficients: ## (Intercept) total_l ## 42.7098 0.5729 ``` ] .panel[.panel-name[Code] ```r (lm_fit <- lm(head_l ~ total_l, data=possum)) ``` ] ] -- - On the next slide, we print the output of `summary(lm_fit)` --- ## lm summary output .panelset[ .panel[.panel-name[Results] ``` ## ## Call: ## lm(formula = head_l ~ total_l, data = possum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.1877 -1.5340 -0.3345 1.2788 7.3968 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 42.70979 5.17281 8.257 5.66e-13 *** ## total_l 0.57290 0.05933 9.657 4.68e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.595 on 102 degrees of freedom ## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725 ## F-statistic: 93.26 on 1 and 102 DF, p-value: 4.681e-16 ``` ] .panel[.panel-name[Code] ```r summary(lm_fit) ``` ] ] -- - For obtaining a confidence interval, the relevant information is provided by `Estimate`, `Std. Error`, and the reported degrees of freedom. --- ## Regression CI Example - Based on the information provided in the `summary` command output, we can construct confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)`. First we observe that the degrees of freedom is 102, then the `\(t^{\ast}_{102}\)` value for a 95% CI is .panelset[ .panel[.panel-name[Results] ``` ## [1] 1.983495 ``` ] .panel[.panel-name[Code] ```r (t_ast <- -qt((1.0-0.95)/2,df=102)) ``` ] ] -- - Now we can obtain our confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)`: .panelset[ .panel[.panel-name[Results] ``` ## [1] 32.4734 52.9466 ``` ``` ## [1] 0.4512 0.6888 ``` ] .panel[.panel-name[Code] ```r (beta_0_CI <- 42.71 + 1.98*c(-1,1)*5.17) (beta_1_CI <- 0.57 + 1.98*c(-1,1)*0.06) ``` ] ] --- ## Hypothesis Testing for Linear Regression - There are actually several types of hypothesis tests one can conduct relating to linear regression models. -- - The most common test is of the form - `\(H_{0}: \beta_{1} = 0\)`. The true linear model has slope zero. Versus -- - `\(H_{A}: \beta_{1} \neq 0\)`. The true linear model has a slope different than zero. -- - The `summary` command output includes a p-value for testing such a hypothesis. However, be aware that the `lm` command does not check whether the conditions for a linear model are met and the results for inference on model parameters is only valid if the conditions for a linear model are met. --- ## Hypothesis Test for possum Data - Again, the output for `summary(lm_fit)` is .panelset[ .panel[.panel-name[Results] ``` ## ## Call: ## lm(formula = head_l ~ total_l, data = possum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.1877 -1.5340 -0.3345 1.2788 7.3968 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 42.70979 5.17281 8.257 5.66e-13 *** ## total_l 0.57290 0.05933 9.657 4.68e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.595 on 102 degrees of freedom ## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725 ## F-statistic: 93.26 on 1 and 102 DF, p-value: 4.681e-16 ``` ] .panel[.panel-name[Code] ```r summary(lm_fit) ``` ] ] -- - The p-value corresponding to the slope estimate `\(b_{1}\)` is much smaller than 0.05 so we will decide to reject the null hypothesis `\(H_{0}: \beta_{1} = 0\)` at the `\(\alpha = 0.05\)` significance level. --- ## Inference for Regression Video - Please watch this video to gain further perspective on inference for linear regression:
--- ## More R For Linear Models - To fit a linear model in R we use the `lm` command. The necessary input is a formula of the form `y ~ x` and the data. The `summary` command outputs all of the information relevant for inferential purposes. -- - However, the output from `summary` is not necessarily formatted in the most convenient way. -- - Another approach to working with regression model output is provided by functions in the `broom` package. For example, the `tidy` function from `broom` displays the results of the model fit: .panelset[ .panel[.panel-name[Results] ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 42.7 5.17 8.26 5.66e-13 ## 2 total_l 0.573 0.0593 9.66 4.68e-16 ``` ] .panel[.panel-name[Code] ```r tidy(lm_fit) ``` ] ] -- - Let's go to R together, work some more examples, and learn to work with the `broom` functions. --- ## Worked Examples - Let's work some more examples relating to linear regression. --- ## Comparing Many Means - In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called **analysis of variance** (ANOVA), see 7.5. -- - ANOVA uses a single hypothesis to check whether the means across many groups are equal: -- - `\(H_{0}: \mu_{1}=\mu_{2}=\cdots = \mu_{k}\)`. The mean outcome is the same across all groups. Versus - `\(H_{A}:\)` At least one mean is different. -- - We must check three conditions for ANOVA: -- - (1) Observations are independent across groups. (2) The data within each group are nearly normal. (3) The variability across each group is about equal. --- ## Example Data Motivating ANOVA - Let's consider our chicken feed data again. The first few rows are shown below: .panelset[ .panel[.panel-name[Results] ``` ## weight feed ## 1 179 horsebean ## 2 160 horsebean ## 3 136 horsebean ## 4 227 horsebean ## 5 217 horsebean ## 6 168 horsebean ``` ] .panel[.panel-name[Code] ```r head(chickwts) ``` ] ] -- - The next slide show a plot of this data. --- ## Plot of Data .panelset[ .panel[.panel-name[Results] <img src="index_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r chickwts %>% ggplot(aes(x=feed,y=weight)) + geom_boxplot() ``` ] ] -- - It appears that the mean weight is not the same for all groups of chickens. The question we would like to have an answer to is, is the observed difference significant? --- ## The F Statistics and the F-Test - Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups. -- - ANOVA uses a test statistic denoted by `\(F\)`, which represents a standardized ratio of variability in the sample means relative to the variability within groups. -- - ANOVA uses an `\(F\)` distribution to compute a p-value that corresponds to the probability of observing an `\(F\)` statistic value that is as or more extreme than the sample `\(F\)` statistic value under the assumption that the null hypothesis is true. -- - We will see how to conduct ANOVA and an `\(F\)`-test using R. -- - Before conducting ANOVA, we should discuss the necessary conditions for an ANOVA analysis. --- ## Conditions for ANOVA - There are three conditions we must check for an ANOVA: -- - Independence. If the data are a simple random sample, this condition is satisfied. -- - Normality. As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is small. Grouped histograms are a good way to diagnose potential problems with the normality assumption for ANOVA. -- - Constant variance. The variance in the groups should be close to equal. This assumption can be checked with side-by-side box plots. --- ## ANOVA Examples - Let's see some examples of conducting ANOVA. We will do this together in R. --- ## Next Time: Inference for Categorical Data - Now that we have explored a lot about inference for numerical data, our next topic is some methods of inference for categorical data. The included video is a good place to start:
--- ## Notes --- ## Notes --- ## Notes