class: center, middle, inverse, title-slide .title[ # MATH 204 Introduction to Statistics ] .subtitle[ ## Lecture 4: Examining Numerical Data ] .author[ ### JMG ] --- ## Goals for Lecture * Introduce graphical and numerical summaries for numerical data. Textbook section 2.1 -- * Define mean, median, variance, and standard deviation. -- * Introduce histograms and boxplots. -- * Define and discuss key features of a distribution for a numerical variable. --- ## Numerical Data Video On your own time, watch this video corresponding to textbook section 2.1.
--- ## Recollections Recall from a previous lecture, we introduced data. -- - We discussed data collection and sampling strategies; and - described the structure of data and variable types. --- ## The Research Workflow .center[ <img src="https://www.dropbox.com/s/ycrjss9s0mn9o8e/3-s2.0-B9780128207888000109-f01-01-9780128207888.gif?raw=1" width="30%" /> ] - Research begins with a question - Then, data is collected to answer the question -- In this lecture, we describe the next step in the research process --- ## Exploratory Data Analysis (EDA) - Once we have collected data, the next step in the statistical process is to begin exploring the data. -- - Typically, it is not helpful (possible) to look at an entire data set and gain meaningful insight. Instead, we work with summaries of our data, both graphical summaries and numeric summaries. -- - Numeric summaries of data are often called [**descriptive statistics**](https://en.wikipedia.org/wiki/Descriptive_statistics). -- - It is important to note that the type of a variable, *i.e.*, numerical or categorical, will determine the kind of graphical or numeric summary that is used. --- ## Exploring Numerical Variables We begin our introduction to EDA by discussing summaries for numerical data. Specifically, we introduce -- - Dot plots and the mean -- - Histograms and shape -- - Variance and standard deviation -- - Box plots, the median, and robust statistics -- - Scatterplots for paired numerical variables These topics are discussed in section 2.1 of the textbook. --- ## Example Data In this lesson we work with the `loan50` example data set. > This data set represents thousands of loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals. --- ## Data Glimpse .panelset[ .panel[.panel-name[Data] ``` ## Rows: 50 ## Columns: 18 ## $ state <fct> NJ, CA, SC, CA, OH, IN, NY, MO, FL, FL, MD, HI… ## $ emp_length <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,… ## $ term <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36… ## $ homeownership <fct> rent, rent, mortgage, rent, mortgage, mortgage… ## $ annual_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288… ## $ verified_income <fct> Not Verified, Not Verified, Verified, Not Veri… ## $ debt_to_income <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667… ## $ total_credit_limit <int> 95131, 51929, 301373, 59890, 422619, 349825, 1… ## $ total_credit_utilized <int> 32894, 78341, 79221, 43076, 60490, 72162, 2872… ## $ num_cc_carrying_balance <int> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3… ## $ loan_purpose <fct> debt_consolidation, credit_card, debt_consolid… ## $ loan_amount <int> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1… ## $ grade <fct> B, B, E, B, B, B, D, A, A, C, D, A, A, A, A, E… ## $ interest_rate <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6… ## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0… ## $ loan_status <fct> Current, Current, Current, Current, Current, C… ## $ has_second_income <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS… ## $ total_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288… ``` ] .panel[.panel-name[Code] ```r library(openintro) glimpse(loan50) ``` ] ] --- ## A Single Numerical Variable Consider just the `interest_rate` variable in the `loan50` data. -- - We can see `interest_rate` by plotting the values of the variable on the number line: <img src="index_files/figure-html/oneD_scatter-1.png" style="display: block; margin: auto;" /> There is overlap in the data, so darker dots correspond to more points laid on top of one another (*i.e.*, and higher density). --- ## Stacked Dot Plots We could also stack points that take on the same values in order to create a **dot plot**: <img src="index_files/figure-html/eg_dot-1.png" style="display: block; margin: auto;" /> -- - **Question:** What information is provided by this plot? --- ## Descriptive Statistics - Plots of data are very helpful in many ways. However, it is also nice to have simple ways to summarize and/or characterize the distribution of numerical data quantitatively. -- - In fact, statistics provides us with methods for summarizing and characterizing numerical data. The most common ones are: -- - 1) The sample **mean** and **median** which measure the *center* of a distribution of data, and -- - 2) the sample **variance** and **standard deviation** which measure the *spread* of a distribution of data. The variance is the average squared distance from the mean, and the standard deviation tells us how far the data are distributed from the mean. -- We will now give mathematical definitions for mean, median, variance, and standard deviation and show how to compute all of these quantities both by hand and using R. --- ## Definition of Mean and Examples In words, we define the mean of sample data by `\(\text{sample mean (average) of data} = \frac{\text{sum of all sample data values}}{\text{sample size}}.\)` -- Mathematically, we define the mean of sample data by the formula `\(\bar{x} = \frac{x_{1} + x_{2} + \cdots + x_{n}}{n}.\)` -- Note that we read `\(\bar{x}\)` as "x bar". --- ## Example Computing Mean Consider the following list of numbers: ```r x <- c(2,5,10,4,8,2,6,3,9,10,4,6,5,5,7) ``` We can calculate the mean (or average) of these 15 values in a few different ways: -- - By hand .center[ `\(\frac{2 + 5 + 10 + 4 + 8 + 2 + 6 + 3 + 9 + 10 + 4 + 6 + 5 + 5 + 7}{15} = \frac{86}{15} \approx 5.73\)` ] -- .pull-left[ - Manually in R ```r sum(x)/length(x) ``` ``` ## [1] 5.733333 ``` ] .pull-right[ - Using the `mean` function in R ```r mean(x) ``` ``` ## [1] 5.733333 ``` ] --- ## Mean Example with Data Let's compute the mean of the `interest_rate` variable: .panelset[ .panel[.panel-name[Result] ``` ## [1] 11.5672 ``` ] .panel[.panel-name[Code] ```r mean(loan50$interest_rate) ``` ] ] -- - **Question:** What is the value of `\(n\)` (*i.e.*, the sample size) for the `interest_rate` data? -- - We can compute this `\(n\)` with .panelset[ .panel[.panel-name[Result] ``` ## [1] 50 ``` ] .panel[.panel-name[Code] ```r length(loan50$interest_rate) ``` ] ] --- ## Thinking About the Mean Let's see where the `interest_rate` mean falls with respect to the data for the variable: <img src="index_files/figure-html/mean_plotted-1.png" style="display: block; margin: auto;" /> **Question:** In what sense does the sample mean of the `interest_rate` data measure the center of the distribution of the sample data? How much do "extreme" values contribute to the mean? --- ## Definition of Median If the data are ordered from smallest to largest, the sample **median** is the observation right in the middle. -- - If there are an even number of observations, there will be two values in the middle, and the median is taken as their average value. -- - The median is also known as the 50th percentile or 50th quantile. --- ## Median Example Consider again the list of numbers contained in the vector `x`: ```r x ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 ``` -- It is simple to order these from smallest to largest: ```r sort(x) ``` ``` ## [1] 2 2 3 4 4 5 5 5 6 6 7 8 9 10 10 ``` The value in the middle is obviously 5. -- Let's confirm this using the R command `median`: ```r median(x) ``` ``` ## [1] 5 ``` --- ## Median with Even Number of Values To see what happens when we have an even number of sample values consider the following list of values: ```r y <- c(2,5,3,7,8,1,4,2,6,7) ``` -- Again, we can order them as ```r sort(y) ``` ``` ## [1] 1 2 2 3 4 5 6 7 7 8 ``` -- Then the median should be `\(\frac{4+5}{2} = \frac{9}{2}=4.5\)` -- Let's confirm this with R: ```r median(y) ``` ``` ## [1] 4.5 ``` --- ## Median Example with Data The median of the `interest_rate` data is computed as ```r median(loan50$interest_rate) ``` ``` ## [1] 9.93 ``` --- ## Thinking About the Median The following plot shows the interest rate data plus both the sample mean and sample median. <img src="index_files/figure-html/plot_median-1.png" style="display: block; margin: auto;" /> -- - **Question:** In what sense does the sample median of the `interest_rate` data measure the center of the distribution of the sample data? How much do "extreme" values contribute do the mean? --- ## Robust Statistics Consider our data `x` again, ```r x ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 ``` ```r (mean(x)) ``` ``` ## [1] 5.733333 ``` ```r (median(x)) ``` ``` ## [1] 5 ``` -- Let's add an **outlier** ```r (xl <- c(x,20)) ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 20 ``` --- ## Mean and Median with Outliers Observe what happens if we compute the mean and median with an outlier in the data: ```r (xl <- c(x,20)) ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 20 ``` ```r (mean(xl)) ``` ``` ## [1] 6.625 ``` ```r (median(xl)) ``` ``` ## [1] 5.5 ``` -- - Notice that the median is much less sensitive to the outlier than the mean is. Because of this, we call the median a **robust statistic**. --- ## IQR The interquartile range (IQR) is another example of a robust statistic. ```r (IQR(x)) ``` ``` ## [1] 3.5 ``` ```r (IQR(xl)) ``` ``` ## [1] 4.25 ``` -- - What exactly is the interquartile range? It will take a couple of slides to answer this question. --- ## Boxplots .center[ <img src="https://www.dropbox.com/s/qpgw6veozn2z4jx/boxPlotLayoutNumVar.jpg?raw=1" width="55%" style="display: block; margin: auto;" /> ] --- ## Definition of IQR `\(IQR = Q_{3} - Q_{1}\)` where - `\(Q_{1} =\)` the 25th percentile - `\(Q_{3} =\)` the 75th percentile --- ## Boxplots with R Suppose we want to create a boxplot for our `interest_rate` variable in the `loan50` data, then we would do as follows: .panelset[ .panel[.panel-name[Boxplot] <img src="index_files/figure-html/r-boxplot-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r gf_boxplot(~interest_rate,data=loan50) + coord_flip() ``` ] ] --- ## Some R Practice Go to the R console and run the following commands: -- - Compute the mean for the highway gas mileage `hwy_mpg` variable from the `epa2021` data set. ```r mean(epa2021$hwy_mpg) ``` -- - Compute the median for the highway gas mileage `hwy_mpg` variable from the `epa2021` data set. ```r median(epa2021$hwy_mpg) ``` --- ## Definition of Variance and Standard Deviation Mathematically, we define the sample **variance** (denoted by `\(s^2\)`) by the formula `\(s^2 = \frac{(x_1 - \bar{x})^2+(x_{2} - \bar{x})^2+\cdots +(x_{n}-\bar{x})^2}{n-1}.\)` Take care to note that the denominator is `\(n-1\)`, that is, the sample size minus 1. -- The sample **standard deviation** (denoted by `\(s\)`) is simply the square root of the sample variance. That is `\(s = \sqrt{\text{sample variance}} = \sqrt{\frac{(x_1 - \bar{x})^2+(x_{2} - \bar{x})^2+\cdots +(x_{n}-\bar{x})^2}{n-1}}.\)` --- ## Computing Variance and Standard Deviation In R, we compute the sample variance using the command `var` and the sample standard deviation using the command `sd`. For example ```r (var(x)) ``` ``` ## [1] 6.92381 ``` ```r (sd(x)) ``` ``` ## [1] 2.631313 ``` -- Notice that ```r sqrt(var(x)) ``` ``` ## [1] 2.631313 ``` gives the same as `sd(x)`. --- ## Variance and Standard Deviation for Data Let's compute the sample variance and standard deviation for the `interest_rate` data: ```r (var(loan50$interest_rate)) ``` ``` ## [1] 25.52387 ``` ```r (sd(loan50$interest_rate)) ``` ``` ## [1] 5.052115 ``` --- ## Thinking About Variance and Standard Deviation Consider the following plot: <img src="index_files/figure-html/sd_plot-1.png" style="display: block; margin: auto;" /> Now notice that of 50 data points, 34 of them lie within one standard deviation of the mean. That is, 34 of the data values lie in the interval `\((\bar{x}-s,\bar{x}+s)\)`. That is 68 percent of the data. --- ## Two Standard Deviations What percent of the data lie within two standard deviations of the mean, that is, within the interval `\((\bar{x}-2s,\bar{x}+2s)\)`? Let's see: <img src="index_files/figure-html/two_sd-1.png" style="display: block; margin: auto;" /> Of 50 data points, 48 lie within two standard deviations of the mean. That is, 48 of the data values lie in the interval `\((\bar{x}-2s,\bar{x}+2s)\)`. That is 96 percent of the data. Not always, but very often about 70% of data lie within one standard deviation of the mean and about 96% of data lie within two standard deviations of the mean. --- ## Histograms and Shape - Dot plots show the exact value for each observation in a sample. -- - A histogram "bins" the sample data into distinct intervals and counts the frequency of data points occurring within each bin interval. --- ## Histogram Example with Data The following plot shows a histogram for the `interest_rate` variable in the `loan50` data set: .panelset[ .panel[.panel-name[Histogram] <img src="index_files/figure-html/eg-hist-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r gf_histogram(~interest_rate,data=loan50,boundary=5,binwidth=2.5,color="black") + scale_x_continuous(breaks = c(5,10,15,20,25)) ``` ] ] -- - Let's examine how this histogram is constructed. --- ## Constructing a histogram Look at the `interest_rate` variable data after sorting it: ``` ## [1] 5.31 5.31 5.32 6.08 6.08 6.08 6.71 6.71 7.34 7.35 7.35 7.96 ## [13] 7.96 7.96 7.97 9.43 9.43 9.44 9.44 9.44 9.92 9.92 9.92 9.92 ## [25] 9.93 9.93 10.42 10.42 10.90 10.90 10.91 10.91 10.91 11.98 12.62 12.62 ## [37] 12.62 14.08 15.04 16.02 17.09 17.09 17.09 18.06 18.45 19.42 20.00 21.45 ## [49] 24.85 26.30 ``` -- - If our bins are the intervals `\([5,7.5], [7.5,10.0], \ldots\)`, then we see that there are 11 data points in bin 1, 15 data points in bin 2, etc. --- ## The Use of Histograms - Histograms provide a visualization of the density of sample data. -- - That is, higher bars represent higher frequency of a range of values. -- - Histograms also indicate the shape of the distribution of data. --- ## Shape of a Distribution A histogram might suggest if our data is **skewed**: <img src="index_files/figure-html/eg_skew-1.png" style="display: block; margin: auto;" /> --- ## Multimodal Distributions A distriubtion with one peak is called unimodal, a distribution with two peaks is called bimodal, and a distribution with more than two peaks is called multimodal. -- <img src="index_files/figure-html/eg_modals-1.png" style="display: block; margin: auto;" /> --- ## Summary In this lecture, we covered the topics of - Descriptive statistics: mean, median, variance, and standard deviation; and - we introduced scatterplots, histograms, and boxplots for visualizing numerical data -- - R commands for computing mean, median, variance, and standard deviation were introduced; and - we saw one way to obtain a boxplot or a histogram using R. -- - We also introduced the concept of outliers and robust statistics. --- ## Next Time Before next time, watch this video corresponding to textbook section 2.2.
--- ## Notes --- ## Notes --- ## Notes