MATH 204 Introduction to Statistics

class: center, middle, inverse, title-slide

.title[
# MATH 204 Introduction to Statistics
]
.subtitle[
## Lecture 10: Normal Distributions
]
.author[
### JMG
]

---

## Goals for Lecture

In this lecture, we will

- Study the **normal distribution** and learn how to compute probability values for a random variable that follows a normal distribution.

- The normal distribution is a continuous distribution. We will discuss the so-called normal curve which is the probability density function for a normal distribution.

- We will also learn about the expected value and variance for a normal random variable.

- This lecture corresponds to section 4.1 in the textbook, and you are encouraged to watch the lecture video included in the next slide.

---

## Normal Distribution Video

---

## Samples from a Normal Distribution

- The following histogram shows 5,000 random samples from a normal random variable:

.panelset[
.panel[.panel-name[Normal Samples]
<img src="index_files/figure-html/norm_samps-1.png" style="display: block; margin: auto;" />

]

.panel[.panel-name[Code]

```r
tibble(x=rnorm(5000,66,3.5)) %>% 
  gf_histogram(~x,color="black") +
  theme_minimal(base_size = 22) 
```
]
]

---

## Normal Histogram Discussion

- Think of the histogram on the previous slide as showing sample data for measurements of human heights in inches.

- What are the key features of the histogram?

- It is unimodal, highly symmetric, and centered at the mean.

- Do you think that such data could reasonably correspond to measurements of human heights in inches?

- Do you think it is reasonable to treat measurements of human heights in inches as a continuous variable?

---

## Normal Distribution Facts

> Many random variables are nearly normal, but none are exactly normal. Thus the normal distribution, while not perfect for any single problem, is very useful for a variety of problems. We will use it in data exploration and to solve important problems in statistics. 
--

- Let's spend some time to develop some intuition for how the normal distribution is often used in practice.

- We start by generating some data that is not necessarily normally distributed.

---

## Sampling a Mean

- We are going to play a game. We proceed as follows:

1) Sample 15 values from a binomial random variable with `$n=25$` and `$p=0.5$`. You can think of this as doing 15 rounds of an experiment where each time we flip a fair coin 25 times and count the number of heads that we obtain.
  
--

2) We compute and record the mean of the 15 values obtained in step 1.

3) We repeat steps 1 & 2 a very large number of times, say 2,500. 
  
--

- Here is the first few rows of a data frame that contains the data we acquire by playing our game.

```
## # A tibble: 6 × 1
##   means
##   <dbl>
## 1  12.5
## 2  12.4
## 3  12  
## 4  13.4
## 5  12.5
## 6  12
```

---

## Plotting Our Data

- Here is a histogram of the data we obtained from the game described on the previous slide:

.panelset[
.panel[.panel-name[Sample Means]
<img src="index_files/figure-html/samp_mean_hist-1.png" style="display: block; margin: auto;" />

]

.panel[.panel-name[Code]

```r
means_df %>% gf_histogram(~means,color="black") +
  theme_minimal(base_size = 22)
```
]
]

- The mean of our data is 12.469. What is the expected value of a binomial random variable with `$n=25$` and `$p=0.5$`?

---

## A Normal Density Function

- The following plot shows a **normal** density curve:

.panelset[
.panel[.panel-name[Normal Density]
<img src="index_files/figure-html/norm_dens,-1.png" style="display: block; margin: auto;" />

]

.panel[.panel-name[Code]

```r
gf_dist("norm") +
  theme_minimal(base_size = 22)
```
]
]

---

## Center and Shape of a Normal Curve

- There are two **parameters** called `$\mu$` (mean) and `$\sigma$` (sd) that determine the center and shape, respectively of a normal curve. For example,:

---

## Going Back to Data

- Recall that to obtain our data, we drew 15 **independent** samples from a binomial distribution with `$n=25$` and `$p=0.5$`, took the sample mean, and repeated this many times. The following plot shows the histogram of our data (with density instead of count on the `$y$`-axis) and overlays a particular normal density curve.

---

## Sampling Distribution of the Mean

- We can think of the sample mean as a random variable. The sample mean inputs 
sample data of a fixed size from a population and returns the mean of the data.

- The point is that the sample mean will vary as the sample varies.

- What the last slide shows us is that, at least in the particular example, the distribution of the sample mean (viewed as a random variable) is very close to a random variable that is normally distributed.

- In general, we call the distribution of the sample mean (viewed as a random variable), the **sampling distribution of the mean**.  It is a general fact that the sampling distribution of the mean (for independent samples) is always very close to a normal distribution, regardless of the type of distribution used to sample the data for which the sample mean is computed.

- This is the reason why the normal distribution plays such a central role in statistics.

---

## Notation for Normal Random Variables

- If `$X$` is a normal random variable with expected value `$\mu=E(X)$` and standard deviation `$\sigma = \sqrt{\text{Var}(X)}$`, then we write `$X \sim N(\mu,\sigma)$`.

- For example, if we have a normal random variable `$X$` with expected value `$12.5$` and standard deviation `$0.5$`, then we write `$X \sim N(\mu=12.5,\sigma=0.5)$`.

- We call a random variable `$Z$` that satisfies `$Z \sim N(\mu=0,\sigma=1)$` a **standard normal variable** and we call the normal distribution with `$\mu=0$` and `$\sigma=1$` the **standard normal distribution**.

- Our next goal is to see how to use the normal density function to compute probability values for a random variable that follows a normal distribution.

---

## Standardizing with Z-scores

- Let `$X \sim N(\mu,\sigma)$` be a normal random variable.

- Define a new random variable `$Z = \frac{X - \mu}{\sigma}$`. We can compute the expectation and variance of `$Z$` as follows:

`$$E\left(Z\right)=E\left(\frac{X - \mu}{\sigma}\right)=\frac{1}{\sigma}E\left(X - \mu\right) = \frac{1}{\sigma}(E(X) - \mu) = \frac{1}{\sigma}(\mu - \mu) = 0$$`

`$$\text{Var}\left(Z\right)=\text{Var}\left(\frac{X - \mu}{\sigma}\right)=\frac{1}{\sigma^2}\text{Var}\left(X - \mu\right)=\frac{1}{\sigma^2}\text{Var}(X) = \frac{\sigma^2}{\sigma^2}=1$$`

- So, if `$X\sim N(\mu,\sigma)$`, then `$Z\sim N(0,1)$`, where `$Z = \frac{X-\mu}{\sigma}$`.

- This is all a little abstract, so let's think about standardizing from a different perspective.

---

## Standardizing Data

- The following table shows the mean and standard deviation for total scores on the SAT and ACT exams:

- Suppose we know that one person scored 1300 on the SAT and another person scored 24 on the ACT. Who had the better exam score? The numbers alone can not be compared. But their standardized `$Z$`-scores can.

`$$\text{person 1 Z-score}=\frac{1300-1100}{200}=1,\ \ \text{person 2 Z-score}=\frac{24-21}{6}=\frac{1}{2}$$`

- So we conclude that person 1 had the better exam score.

- Note that observations above the mean have a positive `$Z$`-score while observations below the mean have a negative `$Z$`-score.

---

## Probability for Normal R.V.'s

- The first thing to know is that the total area under a normal density function is equal to 1. This is true regardless of the values for `$\mu$` and `$\sigma$`. Thus, the shaded area shown below is 1:

---

## Tail Areas

- Suppose that `$X \sim N(\mu,\sigma)$`, how do we find an area such as one of the ones shown below. That is, how do we find the area under one of the tails of a normal density function?

---

## Interpreting Tail Areas

- In the last slide, the shaded area on the left represents the probability of an outcome being less than or equal to -1.5.

- In the last slide, the shaded area on the right represents the probability of an outcome being greater than or equal to 1.8.

- We will use `$Z$`-scores and the standard normal distribution to compute such areas.

- We proceed by first learning to compute tail areas under the standard normal density curve which corresponds to a random variable `$Z \sim N(\mu=0,\sigma=1)$`.

---

## Standard Normal Tail Areas

- We spell out the steps for computing tail areas under the standard normal density function.

- First, draw a picture of the tail area you want to compute. 
  
--
  
  - Decide if it's a left tail area or a right tail area. 
  
--

- If it's a left tail area, use `pnorm(z)` in R to compute the value.
  
--

- If it's a right tail area, use `1 - pnorm(z)` in R to compute the value. 
 
--

- The next slide explains why this approach works.

---

## The Meaning of `pnorm`

- The R command `pnorm(z)` computes the area under the standard normal density function for values less than or equal to `$z$`. That is, it computes the tail area to the left of the value `z`.

---

## Right Tail Areas

- This figure explains why we use `1-pnorm(z)` to compute right tail areas.

- Since the total area is 1 and `pnorm(z)` gives the left tail area it must be that `1-pnorm(z)` gives the right tail area.

---

## Left Tail Example

- What is the area under the standard normal density curve to the left of -2? That is, what is the area shown in the figure?

- It is `pnorm(-2) = ` 0.0227501.

---

## Right Tail Example

- What is the area under the standard normal density curve to the right of 1? That is, what is the area shown in the figure?

- It is `1-pnorm(1) = ` 0.1586553.

---

## Middle Areas

- You are probably curious as to how we can compute a "middle" area such as the one shown below.

- The answer is simple, find the left **and** right tail areas and then subtract from 1.

---

## Middle Area Example

- To compute the area shown:

.panelset[
.panel[.panel-name[Mid Area]

```
## [1] 0.5763928
```

]

.panel[.panel-name[Code]

```r
l_area <- pnorm(-0.5) # compute left tail area
r_area <- 1 - pnorm(1.2) # copute right tail area
(mid_area <- 1 - (l_area + r_area)) # subtract tails areas from 1
```
]
]

---

## Alternative Method for Mid Areas

- Notice that the following computation also finds the mid area shown on the previous slide

.panelset[
.panel[.panel-name[Mid Area]

```
## [1] 0.5763928
```

]

.panel[.panel-name[Code]

```r
pnorm(1.2) - pnorm(-0.5)
```
]
]

- Take a minute to think about why this works and then we will discuss together.

- What does `pnorm(1.2)` represent? Draw the area under the standard normal density curve that it represents.

- What does `pnorm(-0.5)` represent? Draw the area under the standard normal density curve that it represents.

- Now what does the difference `pnorm(1.2) - pnorm(-0.5)` represent?

---

## Areas for General Normal Density Curves

- Suppose that `$X \sim N(\mu = 10, \sigma=2.5)$`. How can we find the left tail area the density curve corresponding to `$N(\mu = 10, \sigma=2.5)$` that lies to the left of `$x=7.3$`? This area is shown below.

- The answer is, we first standardize to find `$z=\frac{x-\mu}{\sigma}=\frac{7.3-10}{2.5}=$` -1.08, then we compute `pnorm(z) = ` 0.1400711.

---

## General Steps for Normal Curve Areas

- To compute the area under a normal density curve for `$N(\mu,\sigma)$`,

- First draw a picture and determine if it is a left tail area, right tail area, or middle area. 
  
--

- Then standardize by subtracting the mean `$\mu$` and dividing by the standard deviation `$\sigma$`. 
  
--

- Finally, use `pnorm` in R as we have explained over the last several slides. 
  
--

- We take a break from the slides to work out some examples together.

---

## The 68-95-99.7 Rule

- The following figure shows the percentage of area under a normal curve that lies within 1, 2, and 3 standard deviations of the mean, respectively.

.center[
<img src="https://www.dropbox.com/s/2e73g42uinqq5gz/6895997.png?raw=1" width="75%" />
]

---

## Worked Examples

- Let's take a break from lecture slides to go through some worked examples using the normal distribution.

---

## Foundations for Inference

- We have now developed the probability tools we need to begin to discuss statistical inference. Before the next lecture, please watch this video:

---

## Notes

---

## Notes

---

## Notes