class: center, middle, inverse, title-slide .title[ # MATH 204 Introduction to Statistics ] .subtitle[ ## Lecture 6: Examining Categorical Data ] .author[ ### JMG ] --- ## Goals for Lecture * Introduce graphical and numerical summaries for numerical data. Textbook section 2.2 -- * Define count and proportion tables. -- * Introduce bar plots and mosaic plots. -- * Define and discuss key features of a distribution for a categorical variable. -- * Introduce grouped data and box plots. --- ## Categorical Data Video On your own time, watch this video corresponding to textbook section 2.2.
--- ## Categorical Data Recall that [categorical data](https://en.wikipedia.org/wiki/Categorical_variable) corresponds to observations that are qualitative properties or features. -- - Categorical data can be further classified as being binary, nominal, or ordinal. -- - We call the possible values of a categorical variable the **levels** of the variable. Another name for a categorical variable is **factor**. --- ## Example - Recalling the animal shelter data set from lecture 1, <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> animal_id </th> <th style="text-align:left;"> animal </th> <th style="text-align:left;"> mf </th> <th style="text-align:right;"> age </th> <th style="text-align:left;"> name </th> <th style="text-align:left;"> outcome </th> <th style="text-align:right;"> stay </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A087526 </td> <td style="text-align:left;"> Dog </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:left;"> Gizmo </td> <td style="text-align:left;"> Adoption </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> A04354 </td> <td style="text-align:left;"> Dog </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> Quentin </td> <td style="text-align:left;"> Adoption </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> A033375 </td> <td style="text-align:left;"> Cat </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 7.0 </td> <td style="text-align:left;"> Artemis </td> <td style="text-align:left;"> Return to Owner </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> A081213 </td> <td style="text-align:left;"> Cat </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:left;"> *Birch </td> <td style="text-align:left;"> Transfer </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> A095836 </td> <td style="text-align:left;"> Cat </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:left;"> *Liza </td> <td style="text-align:left;"> Adoption </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:left;"> A065244 </td> <td style="text-align:left;"> Dog </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 10.0 </td> <td style="text-align:left;"> Star </td> <td style="text-align:left;"> Transfer </td> <td style="text-align:right;"> 4 </td> </tr> </tbody> </table> -- - The variable `animal` is a factor with levels "Dog" and "Cat" -- - The variable `name` is also a factor but it is very difficult (maybe impossible) to list all of the possible levels. --- ## Tables - Tables are a straightforward way to summarize categorical data when there aren't too many levels. -- - For example, in the animal shelter data .panelset[ .panel[.panel-name[Table] ``` ## ## Cat Dog ## 17635 30774 ``` ] .panel[.panel-name[R Code] ```r table(shelter_data$animal) ``` ] ] -- - The table function returns the count of how many times each level of a factor appears in the data. This is sometimes called a frequency table. --- ## Bar Plots - A bar plot is the visual analog of a frequency table. For example, .panelset[ .panel[.panel-name[Bar Plot] <img src="index_files/figure-html/eg-bar-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r gf_bar(~animal,data=shelter_data) ``` ] ] --- ## Proportions Tables and bar plots display an absolute count, sometimes it is useful know proportions. -- - For example, `\(\frac{17635}{17635+30774}\approx 0.36\)` is the proportion of `animal` that are cats in the animal shelter data. -- - R can compute such proportions for us: ```r prop.table(table(shelter_data$animal)) ``` ``` ## ## Cat Dog ## 0.3642918 0.6357082 ``` -- - Note that all the proportion values must add to 1. --- ## Proportions Plot - We can plot a proportions table similarly to a bar plot. For example, -- .panelset[ .panel[.panel-name[Prop Plot] <img src="index_files/figure-html/eg-prop-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r gf_props(~animal,data=shelter_data) ``` ] ] --- ## Two Categorical Variables Scatterplots provide a way to summarize together two numerical variables, methods for summarizing together two categorical variables include: -- - Contingency tables -- - Proportion tables -- - Stacked or side-by-side bar plots -- - Mosaic plots -- We will explain each of these tools and illustrate how to obtain them in R. We will work with the `loans_dat` data set. This data set represents thousands of loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals. --- ## Contingency Tables Contingency tables display the number of times a particular combination of variable outcomes occurs. -- - For example, we construct a contingency table for the variables `homeownership` (ownership status of the applicant's residence) and `application_type` (type of application: either individual or joint): -- .panelset[ .panel[.panel-name[Table] ``` ## homeownership ## application_type MORTGAGE OWN RENT Sum ## individual 3839 1170 3496 8505 ## joint 950 183 362 1495 ## Sum 4789 1353 3858 10000 ``` ] .panel[.panel-name[R Code] ```r with(loans_dat,addmargins(table(application_type,homeownership))) ``` ] ] -- - If we create a bar plot for `homeownership`, we will see bars corresponding to the first three values in the last row of our table. Similarly, a bar plot for `application_type` will show bars corresponding to the first two values in the last column of our table. --- ## Bar Plots for `loans_dat` <img src="index_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> MORTGAGE </th> <th style="text-align:right;"> OWN </th> <th style="text-align:right;"> RENT </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> individual </td> <td style="text-align:right;"> 3839 </td> <td style="text-align:right;"> 1170 </td> <td style="text-align:right;"> 3496 </td> <td style="text-align:right;"> 8505 </td> </tr> <tr> <td style="text-align:left;"> joint </td> <td style="text-align:right;"> 950 </td> <td style="text-align:right;"> 183 </td> <td style="text-align:right;"> 362 </td> <td style="text-align:right;"> 1495 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 4789 </td> <td style="text-align:right;"> 1353 </td> <td style="text-align:right;"> 3858 </td> <td style="text-align:right;"> 10000 </td> </tr> </tbody> </table> --- ## Proportion Tables A proportion table displays the same essential information as a contingency table except we divide entries by either the row sums (row proportion table) or the column sums (column proportion table). -- .pull-left[ - Row proportion table <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> MORTGAGE </th> <th style="text-align:right;"> OWN </th> <th style="text-align:right;"> RENT </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> individual </td> <td style="text-align:right;"> 0.4513815 </td> <td style="text-align:right;"> 0.1375661 </td> <td style="text-align:right;"> 0.4110523 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> joint </td> <td style="text-align:right;"> 0.6354515 </td> <td style="text-align:right;"> 0.1224080 </td> <td style="text-align:right;"> 0.2421405 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 1.0868330 </td> <td style="text-align:right;"> 0.2599742 </td> <td style="text-align:right;"> 0.6531928 </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> ] .pull-right[ - Column proportion table <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> MORTGAGE </th> <th style="text-align:right;"> OWN </th> <th style="text-align:right;"> RENT </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> individual </td> <td style="text-align:right;"> 0.8016287 </td> <td style="text-align:right;"> 0.864745 </td> <td style="text-align:right;"> 0.906169 </td> <td style="text-align:right;"> 2.5725427 </td> </tr> <tr> <td style="text-align:left;"> joint </td> <td style="text-align:right;"> 0.1983713 </td> <td style="text-align:right;"> 0.135255 </td> <td style="text-align:right;"> 0.093831 </td> <td style="text-align:right;"> 0.4274573 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 1.000000 </td> <td style="text-align:right;"> 1.000000 </td> <td style="text-align:right;"> 3.0000000 </td> </tr> </tbody> </table> ] --- ## Stacked and Side-By-Side Barplots <img src="index_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Standardized Stacked Bar Plot Stacked bar plots can be used to construct a visualization of a proportion table. -- - For example, the following stacked bar plot displays our column proportion table as a plot: .pull-left[ <img src="index_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> MORTGAGE </th> <th style="text-align:right;"> OWN </th> <th style="text-align:right;"> RENT </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> individual </td> <td style="text-align:right;"> 0.8016287 </td> <td style="text-align:right;"> 0.864745 </td> <td style="text-align:right;"> 0.906169 </td> <td style="text-align:right;"> 2.5725427 </td> </tr> <tr> <td style="text-align:left;"> joint </td> <td style="text-align:right;"> 0.1983713 </td> <td style="text-align:right;"> 0.135255 </td> <td style="text-align:right;"> 0.093831 </td> <td style="text-align:right;"> 0.4274573 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 1.000000 </td> <td style="text-align:right;"> 1.000000 </td> <td style="text-align:right;"> 3.0000000 </td> </tr> </tbody> </table> ] --- ## Mosaic Plots A **mosiac plot** is a visualization that corresponds to contingency tables. They can be one-variable or multi-variable. <img src="index_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## Grouped Data - Grouped numerical data arises when we want to study the distribution of a numerical variable across two or more distinguishing groups. -- - In other words, we are looking for association between two variables where one variable is numerical (typically viewed as the response variable) and the other is categorical (typically viewed as the explanatory variable). -- - For example, in the animal shelter data, we might be interested to know if there is a difference between how long animal remain in the shelter when we compare cats against dogs. -- - The next slide shows side-by-side box plots for the length of stay for cats and dogs. --- ## Grouped Summaries .pull-left[ .panelset[ .panel[.panel-name[Grouped Plot] <!-- --> ] .panel[.panel-name[R Code] ```r gg_boxplot(x=animal,y=stay,data=shelter_data) ``` ] ] ] .pull.right[ .panelset[ .panel[.panel-name[Grouped Mean] <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> animal </th> <th style="text-align:right;"> median_stay </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Cat </td> <td style="text-align:right;"> 23 </td> </tr> <tr> <td style="text-align:left;"> Dog </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> ] .panel[.panel-name[R Code] ```r shelter_data %>% group_by(animal) %>% summarise(mean_stay=mean(stay)) ``` ] ] ] --- ## Summary In this lecture, we covered the topics of -- - Graphical and numerical summaries for categorical data -- - We discussed contigency tables, bar plots, and mosaic plots -- - We introduced the notion of grouped data and grouped summaries --- ## Next Time In the next lecture, we will begin our discussion of probability which forms the foundation of statistics. In preparation, you are encouraged to watch the included video.
--- ## Notes --- ## Notes --- ## Notes