MATH 204 Introduction to Statistics

class: center, middle, inverse, title-slide

.title[
# MATH 204 Introduction to Statistics
]
.subtitle[
## Lecture 4: Examining Numerical Data
]
.author[
### JMG
]

---

## Goals for Lecture

* Introduce graphical and numerical summaries for numerical data. Textbook section 2.1

* Define mean, median, variance, and standard deviation. 
    
--
    
    * Introduce histograms and boxplots.
    
--

* Define and discuss key features of a distribution for a numerical variable.

---

## Numerical Data Video

On your own time, watch this video corresponding to textbook section 2.1. 
<div class="vembedr" align="center">
<div>
<iframe src="https://www.youtube.com/embed/Xm0PPtci3JE" width="711" height="400" frameborder="0" allowfullscreen="" data-external="1"></iframe>
</div>
</div>

---

## Recollections

Recall from a previous lecture, we introduced data.

- We discussed data collection and sampling strategies; and

- described the structure of data and variable types.

---

## The Research Workflow

.center[
<img src="https://www.dropbox.com/s/ycrjss9s0mn9o8e/3-s2.0-B9780128207888000109-f01-01-9780128207888.gif?raw=1" width="30%" />
]

- Research begins with a question

- Then, data is collected to answer the question

In this lecture, we describe the next step in the research process

---

## Exploratory Data Analysis (EDA)

- Once we have collected data, the next step in the statistical process is to begin exploring the data.

- Typically, it is not helpful (possible) to look at an entire data set and gain meaningful insight. Instead, we work with summaries of our data, both graphical summaries and numeric summaries.

- Numeric summaries of data are often called [**descriptive statistics**](https://en.wikipedia.org/wiki/Descriptive_statistics).

- It is important to note that the type of a variable, *i.e.*, numerical or categorical, will determine the kind of graphical or numeric summary that is used.

---

## Exploring Numerical Variables

We begin our introduction to EDA by discussing summaries for numerical data. Specifically, we introduce

- Dot plots and the mean

- Histograms and shape

- Variance and standard deviation

- Box plots, the median, and robust statistics

- Scatterplots for paired numerical variables

These topics are discussed in section 2.1 of the textbook.

---

## Example Data

In this lesson we work with the `loan50` example data set.

> This data set represents thousands of loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals.

---

## Data Glimpse

.panelset[
.panel[.panel-name[Data]

```
## Rows: 50
## Columns: 18
## $ state                   <fct> NJ, CA, SC, CA, OH, IN, NY, MO, FL, FL, MD, HI…
## $ emp_length              <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,…
## $ term                    <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36…
## $ homeownership           <fct> rent, rent, mortgage, rent, mortgage, mortgage…
## $ annual_income           <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
## $ verified_income         <fct> Not Verified, Not Verified, Verified, Not Veri…
## $ debt_to_income          <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667…
## $ total_credit_limit      <int> 95131, 51929, 301373, 59890, 422619, 349825, 1…
## $ total_credit_utilized   <int> 32894, 78341, 79221, 43076, 60490, 72162, 2872…
## $ num_cc_carrying_balance <int> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3…
## $ loan_purpose            <fct> debt_consolidation, credit_card, debt_consolid…
## $ loan_amount             <int> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1…
## $ grade                   <fct> B, B, E, B, B, B, D, A, A, C, D, A, A, A, A, E…
## $ interest_rate           <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6…
## $ public_record_bankrupt  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ loan_status             <fct> Current, Current, Current, Current, Current, C…
## $ has_second_income       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ total_income            <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
```
]

.panel[.panel-name[Code]

```r
library(openintro)
glimpse(loan50)
```
]
]

---

## A Single Numerical Variable

Consider just the `interest_rate` variable in the `loan50` data.

- We can see `interest_rate` by plotting the values of the variable on the number line:

There is overlap in the data, so darker dots correspond to more points laid on top of one another (*i.e.*, and higher density).

---

## Stacked Dot Plots

We could also stack points that take on the same values in order to create a **dot plot**:

- **Question:** What information is provided by this plot?

---

## Descriptive Statistics

- Plots of data are very helpful in many ways. However, it is also nice to have simple ways to summarize and/or characterize the distribution of numerical data quantitatively.

- In fact, statistics provides us with methods for summarizing and characterizing numerical data. The most common ones are:

- 1) The sample **mean** and **median** which measure the *center* of a distribution of data, and

- 2) the sample **variance** and **standard deviation** which measure the *spread* of a distribution of data. The variance is the average squared distance from the mean, and the standard deviation tells us how far the data are distributed from the mean.

We will now give mathematical definitions for mean, median, variance, and standard deviation and show how to compute all of these quantities both by hand and using R.

---

## Definition of Mean and Examples

In words, we define the mean of sample data by

`$\text{sample mean (average) of data} = \frac{\text{sum of all sample data values}}{\text{sample size}}.$`

Mathematically, we define the mean of sample data by the formula

`$\bar{x} = \frac{x_{1} + x_{2} + \cdots + x_{n}}{n}.$`

Note that we read `$\bar{x}$` as "x bar".

---

## Example Computing Mean

Consider the following list of numbers:

```r
x <- c(2,5,10,4,8,2,6,3,9,10,4,6,5,5,7)
```

We can calculate the mean (or average) of these 15 values in a few different ways:

- By hand

.center[

`$\frac{2 + 5 + 10 + 4 + 8 + 2 + 6 + 3 + 9 + 10 + 4 + 6 + 5 + 5 + 7}{15} = \frac{86}{15} \approx 5.73$`

]

.pull-left[

- Manually in R

```r
sum(x)/length(x)
```

```
## [1] 5.733333
```

]

.pull-right[

- Using the `mean` function in R

```r
mean(x)
```

```
## [1] 5.733333
```

]

---

## Mean Example with Data

Let's compute the mean of the `interest_rate` variable:

.panelset[
.panel[.panel-name[Result]

```
## [1] 11.5672
```
]

.panel[.panel-name[Code]

```r
mean(loan50$interest_rate)
```
]
]

- **Question:** What is the value of `$n$` (*i.e.*, the sample size) for the `interest_rate` data?

- We can compute this `$n$` with

.panelset[
.panel[.panel-name[Result]

```
## [1] 50
```
]

.panel[.panel-name[Code]

```r
length(loan50$interest_rate)
```
]
]

---

## Thinking About the Mean

Let's see where the `interest_rate` mean falls with respect to the data for the variable:

**Question:** In what sense does the sample mean of the `interest_rate` data measure the center of the distribution of the sample data? How much do "extreme" values contribute to the mean?

---

## Definition of Median

If the data are ordered from smallest to largest, the sample **median** is the observation right in the middle.

- If there are an even number of observations, there will be two values in the middle, and the median is taken as their average value.

- The median is also known as the 50th percentile or 50th quantile.

---

## Median Example

Consider again the list of numbers contained in the vector `x`:

```r
x
```

```
##  [1]  2  5 10  4  8  2  6  3  9 10  4  6  5  5  7
```

It is simple to order these from smallest to largest:

```r
sort(x)
```

```
##  [1]  2  2  3  4  4  5  5  5  6  6  7  8  9 10 10
```

The value in the middle is obviously 5.

Let's confirm this using the R command `median`:

```r
median(x)
```

```
## [1] 5
```

---

## Median with Even Number of Values

To see what happens when we have an even number of sample values consider the following list of values:

```r
y <- c(2,5,3,7,8,1,4,2,6,7)
```

Again, we can order them as

```r
sort(y)
```

```
##  [1] 1 2 2 3 4 5 6 7 7 8
```

Then the median should be

`$\frac{4+5}{2} = \frac{9}{2}=4.5$`

Let's confirm this with R:

```r
median(y)
```

```
## [1] 4.5
```

---

## Median Example with Data

The median of the `interest_rate` data is computed as

```r
median(loan50$interest_rate)
```

```
## [1] 9.93
```

---

## Thinking About the Median

The following plot shows the interest rate data plus both the sample mean and sample median.

- **Question:** In what sense does the sample median of the `interest_rate` data measure the center of the distribution of the sample data? How much do "extreme" values contribute do the mean?

---

## Robust Statistics

Consider our data `x` again,

```r
x
```

```
##  [1]  2  5 10  4  8  2  6  3  9 10  4  6  5  5  7
```

```r
(mean(x))
```

```
## [1] 5.733333
```

```r
(median(x))
```

```
## [1] 5
```

Let's add an **outlier**

```r
(xl <- c(x,20))
```

```
##  [1]  2  5 10  4  8  2  6  3  9 10  4  6  5  5  7 20
```

---

## Mean and Median with Outliers

Observe what happens if we compute the mean and median with an outlier in the data:

```r
(xl <- c(x,20))
```

```
##  [1]  2  5 10  4  8  2  6  3  9 10  4  6  5  5  7 20
```

```r
(mean(xl))
```

```
## [1] 6.625
```

```r
(median(xl))
```

```
## [1] 5.5
```

- Notice that the median is much less sensitive to the outlier than the mean is. Because of this, we call the median a **robust statistic**.

---

## IQR

The interquartile range (IQR) is another example of a robust statistic.

```r
(IQR(x))
```

```
## [1] 3.5
```

```r
(IQR(xl))
```

```
## [1] 4.25
```

- What exactly is the interquartile range? It will take a couple of slides to answer this question.

---

## Boxplots

.center[
<img src="https://www.dropbox.com/s/qpgw6veozn2z4jx/boxPlotLayoutNumVar.jpg?raw=1" width="55%" style="display: block; margin: auto;" />
]

---

## Definition of IQR

`$IQR = Q_{3} - Q_{1}$`

where

- `$Q_{1} =$` the 25th percentile

- `$Q_{3} =$` the 75th percentile

---

## Boxplots with R

Suppose we want to create a boxplot for our `interest_rate` variable in the `loan50` data, then we would do as follows:

.panelset[
.panel[.panel-name[Boxplot]

<img src="index_files/figure-html/r-boxplot-1.png" style="display: block; margin: auto;" />
]

.panel[.panel-name[R Code]

```r
gf_boxplot(~interest_rate,data=loan50) + 
  coord_flip()
```
]
]

---

## Some R Practice

Go to the R console and run the following commands:

- Compute the mean for the highway gas mileage `hwy_mpg` variable from the  `epa2021` data set.

```r
mean(epa2021$hwy_mpg)
```

- Compute the median for the highway gas mileage `hwy_mpg` variable from the  `epa2021` data set.

```r
median(epa2021$hwy_mpg)
```

---

## Definition of Variance and Standard Deviation

Mathematically, we define the sample **variance** (denoted by `$s^2$`) by the formula

`$s^2 = \frac{(x_1 - \bar{x})^2+(x_{2} - \bar{x})^2+\cdots +(x_{n}-\bar{x})^2}{n-1}.$`

Take care to note that the denominator is `$n-1$`, that is, the sample size minus 1.

The sample **standard deviation** (denoted by `$s$`) is simply the square root of the sample variance. That is

`$s = \sqrt{\text{sample variance}} = \sqrt{\frac{(x_1 - \bar{x})^2+(x_{2} - \bar{x})^2+\cdots +(x_{n}-\bar{x})^2}{n-1}}.$`

---

## Computing Variance and Standard Deviation

In R, we compute the sample variance using the command `var` and the sample standard deviation using the command `sd`. For example

```r
(var(x))
```

```
## [1] 6.92381
```

```r
(sd(x))
```

```
## [1] 2.631313
```

Notice that

```r
sqrt(var(x))
```

```
## [1] 2.631313
```

gives the same as `sd(x)`.

---

## Variance and Standard Deviation for Data

Let's compute the sample variance and standard deviation for the `interest_rate` data:

```r
(var(loan50$interest_rate))
```

```
## [1] 25.52387
```

```r
(sd(loan50$interest_rate))
```

```
## [1] 5.052115
```

---

## Thinking About Variance and Standard Deviation

Consider the following plot:

Now notice that of 50 data points, 34 of them lie within one standard deviation of the mean. That is, 34 of the data values lie in the interval `$(\bar{x}-s,\bar{x}+s)$`. That is 68 percent of the data.

---

## Two Standard Deviations

What percent of the data lie within two standard deviations of the mean, that is, within the interval `$(\bar{x}-2s,\bar{x}+2s)$`? Let's see:

Of 50 data points, 48 lie within two standard deviations of the mean. That is, 48 of the data values lie in the interval `$(\bar{x}-2s,\bar{x}+2s)$`. That is 96 percent of the data.

Not always, but very often about 70% of data lie within one standard deviation of the mean and about 96% of data lie within two standard deviations of the mean.

---

## Histograms and Shape

- Dot plots show the exact value for each observation in a sample.

- A histogram "bins" the sample data into distinct intervals and counts the frequency of data points occurring within each bin interval.

---

## Histogram Example with Data

The following plot shows a histogram for the `interest_rate` variable in the `loan50` data set:

.panelset[
.panel[.panel-name[Histogram]

<img src="index_files/figure-html/eg-hist-1.png" style="display: block; margin: auto;" />
]

.panel[.panel-name[R Code]

```r
gf_histogram(~interest_rate,data=loan50,boundary=5,binwidth=2.5,color="black") +
  scale_x_continuous(breaks = c(5,10,15,20,25))
```
]
]

- Let's examine how this histogram is constructed.

---

## Constructing a histogram

Look at the `interest_rate` variable data after sorting it:

```
##  [1]  5.31  5.31  5.32  6.08  6.08  6.08  6.71  6.71  7.34  7.35  7.35  7.96
## [13]  7.96  7.96  7.97  9.43  9.43  9.44  9.44  9.44  9.92  9.92  9.92  9.92
## [25]  9.93  9.93 10.42 10.42 10.90 10.90 10.91 10.91 10.91 11.98 12.62 12.62
## [37] 12.62 14.08 15.04 16.02 17.09 17.09 17.09 18.06 18.45 19.42 20.00 21.45
## [49] 24.85 26.30
```

- If our bins are the intervals `$[5,7.5], [7.5,10.0], \ldots$`, then we see that there are 11 data points in bin 1, 15 data points in bin 2, etc.

---

## The Use of Histograms

- Histograms provide a visualization of the density of sample data.

- That is, higher bars represent higher frequency of a range of values.

- Histograms also indicate the shape of the distribution of data.

---

## Shape of a Distribution

A histogram might suggest if our data is **skewed**:

---

## Multimodal Distributions

A distriubtion with one peak is called unimodal, a distribution with two peaks is called bimodal, and a distribution with more than two peaks is called multimodal.

---

## Summary

In this lecture, we covered the topics of

- Descriptive statistics: mean, median, variance, and standard deviation; and

- we introduced scatterplots, histograms, and boxplots for visualizing numerical data

- R commands for computing mean, median, variance, and standard deviation were introduced; and

- we saw one way to obtain a boxplot or a histogram using R.

- We also introduced the concept of outliers and robust statistics.

---

## Next Time

Before next time, watch this video corresponding to textbook section 2.2. 
<div class="vembedr" align="center">
<div>
<iframe src="https://www.youtube.com/embed/7NhNeADL8fA" width="711" height="400" frameborder="0" allowfullscreen="" data-external="1"></iframe>
</div>
</div>

---

## Notes

---

## Notes

---

## Notes