class: center, middle, inverse, title-slide .title[ # MATH 204 Introduction to Statistics ] .subtitle[ ## Lecture 1: Intro to Data ] .author[ ### JMG ] --- ## Goals for Lecture * Introduce overall structure of course, see [syllabus](https://introductiontostatistics.netlify.app/about.html) for important details regarding course logistics. -- * The required textbook for the course is the free [OpenIntro Statistics 4th ed.](https://www.openintro.org/book/os/) -- * Introduce basic data concepts (textbook Chapter 1): -- * Data tables (textbook section 1.2) -- * Data classification (textbook section 1.2) --- ## Course Overview The course is divided broadly into four components: 1) Data, Sampling, Numerical summaries, and Visual summaries (Chapters 1-2) -- 2) Probability and Random Variables (Chapters 3-4) -- 3) Basic statistical inference (Chapters 5-7) -- 4) Linear models (Chapters 8-9) -- Throughout the course, we will make use of the [R](https://www.r-project.org/) statistical computing environment, and other parts of the R ecosystem such as the [RStudio](https://www.rstudio.com/) [integrated development environment](https://en.wikipedia.org/wiki/Integrated_development_environment) (IDE). .pull-right[ <img src="https://www.dropbox.com/s/bl4u8njjco8cj0n/exploder.gif?raw=1" width="125" height="125" /> ] --- ## Intro to R To get an early start with R, see this [blog](https://jennysloane.netlify.app/blog/intro_to_r/) by [Jenny Sloane](https://jennysloane.netlify.app/#about) or the corresponding video:
--- ## Data Basics Statistics is about data, Chapter 1 of the textbook introduces the fundamental ideas related to practices of collecting and organizing data. -- * The term "data" can be interpreted very broadly. In this class, we focus on data that is collected and structured for the use of statistical analysis. -- * For most studies that utilize statistical analyses, data collection and organization should be well thought out and carefully structured. -- * We emphasize data that can be organized into a table or matrix, where each row corresponds to a single case or **observation**, and each column corresponds to a particular **variable**. -- * A variable is a particular characteristic that can be measured or observed for an observational unit. -- Let's look at example data to help motivate our discussion. --- ## Data Example: Animal Shelter Data .left-column[ <img src="https://www.dropbox.com/s/q29jr963etgg307/Ziggy.jpeg?raw=1" width="100%" /> ] .right-column[ The [Austin Animal Center](https://www.austintexas.gov/austin-animal-center) (AAC), a “no kill” animal shelter in Austin, Texas records data on the intake and outcomes of animals that make it to the shelter. ] -- .right-column[ * The data includes information about each animal and the animal's outcome. ] -- .right-column[ * Information about each animal includes characteristics about the animal, as well as when it arrived at the shelter, how long the animal remained at the shelter, and what eventually happened to the animal after being at the shelter. ] -- .right-column[ * The next slide shows a portion of the [animal shelter data](https://data.austintexas.gov/browse). ] --- ## Part of Animal Shelter Data .panelset[ .panel[.panel-name[Data portion] <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> animal_id </th> <th style="text-align:left;"> animal </th> <th style="text-align:left;"> mf </th> <th style="text-align:right;"> age </th> <th style="text-align:left;"> name </th> <th style="text-align:left;"> outcome </th> <th style="text-align:right;"> stay </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A087526 </td> <td style="text-align:left;"> Dog </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:left;"> Gizmo </td> <td style="text-align:left;"> Adoption </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> A04354 </td> <td style="text-align:left;"> Dog </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> Quentin </td> <td style="text-align:left;"> Adoption </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> A033375 </td> <td style="text-align:left;"> Cat </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 7.0 </td> <td style="text-align:left;"> Artemis </td> <td style="text-align:left;"> Return to Owner </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> A081213 </td> <td style="text-align:left;"> Cat </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:left;"> *Birch </td> <td style="text-align:left;"> Transfer </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> A095836 </td> <td style="text-align:left;"> Cat </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:left;"> *Liza </td> <td style="text-align:left;"> Adoption </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:left;"> A065244 </td> <td style="text-align:left;"> Dog </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 10.0 </td> <td style="text-align:left;"> Star </td> <td style="text-align:left;"> Transfer </td> <td style="text-align:right;"> 4 </td> </tr> </tbody> </table> ] .panel[.panel-name[R Code] ```r shelter_data <- read_csv("AnimalShelterData.csv") # read in data # display a few rows and columns of data head(shelter_data[,c(1,4,5,6,14,2,3)]) %>% kbl() %>% kable_styling(full_width = F) ``` ] ] Note that each row corresponds to exactly one animal (observation), and each column corresponds to a particular feature (variable). -- Let's get some additional info about the animal shelter data. --- ## More on Animal Shelter Data .panelset[ .panel[.panel-name[Data glimpse] ``` ## Rows: 48,409 ## Columns: 16 ## $ animal_id <chr> "A087526", "A04354", "A033375", "A081213", "A095836", "A065… ## $ outcome <chr> "Adoption", "Adoption", "Return to Owner", "Transfer", "Ado… ## $ stay <dbl> 6, 66, 2, 3, 65, 4, 0, 1, 42, 16, 63, 16, 36, 3, 29, 2, 30,… ## $ animal <chr> "Dog", "Dog", "Cat", "Cat", "Cat", "Dog", "Dog", "Dog", "Do… ## $ mf <chr> "Male", "Male", "Male", "Male", "Female", "Male", "Female",… ## $ age <dbl> 1.0, 2.0, 7.0, 0.1, 0.1, 10.0, 1.0, 8.0, 10.0, 3.0, 0.0, 6.… ## $ in_month <dbl> 7, 3, 1, 8, 7, 10, 2, 7, 1, 12, 2, 3, 10, 7, 4, 4, 8, 6, 2,… ## $ in_year <dbl> 2018, 2020, 2017, 2019, 2016, 2017, 2018, 2017, 2020, 2016,… ## $ out_month <dbl> 7, 5, 1, 8, 9, 10, 2, 7, 3, 12, 4, 3, 11, 7, 5, 4, 9, 6, 2,… ## $ out_year <dbl> 2018, 2020, 2017, 2019, 2016, 2017, 2018, 2017, 2020, 2016,… ## $ in_reason <chr> "Stray", "Owner Surrender", "Stray", "Stray", "Stray", "Own… ## $ in_intact <dbl> 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,… ## $ out_intact <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ name <chr> "Gizmo", "Quentin", "Artemis", "*Birch", "*Liza", "Star", "… ## $ breed <chr> "Chihuahua Shorthair Mix", "American Foxhound/Labrador Retr… ## $ color <chr> "White/Brown", "White/Brown", "Blue/White", "Brown Tabby", … ``` ] .panel[.panel-name[R Code] ```r glimpse(shelter_data) ``` ] ] --- ## Reflection on Animal Shelter Data * The information on the previous slide shows us that there are 48,409 rows in the animal shelter data and 16 columns. -- * That is, there are 48,409 observations and 16 variables. -- * Let's discuss the significance of each variable in the data set. --- ## Variables in Animal Shelter Data * Each animal that comes to the shelter is assigned a unique id number. -- * For each animal that enters the shelter, their type (dog or cat), sex, age, name, breed, and color is recorded. The data also lists if the animal is fixed when it enters and leaves the shelter. -- * For each animal, information about when it arrives and leaves the shelter is recorded. It is also recorded how long the animal stays in the shelter and the outcome of the animal's stay in the shelter. --- ## Questions from Animal Shelter Data What questions do you think we can try to use the animal shelter data to answer? --- ## Types of Variables In the animal shelter data, we see that some variables like `age` have values that are **quantitative** (numeric) while others like `animal` have values that are **qualitative**. -- * It is useful to distinguish certain common variable types. -- * As we will see throughout the course, the type of a variable will determine the methods used to analyze the data. -- * The next slide displays a diagram that summarizes the common variable types. -- **Warning:** Just because numbers are used to record a feature of an observation, doesn't mean the variable is quantitative! -- * For example, the animal shelter data uses numbers 1 to 12 to represent the month an animal enters or leaves the shelter, but month is not really quantitative. --- ## Classification of Variable Types .center[ <img src="https://www.dropbox.com/s/iak7celnfkh7ss9/variables.png?raw=1" width="75%" /> ] -- Let's elaborate on the categorical variable concept. --- ## Categorical Variable Types .center[ <img src="https://www.dropbox.com/s/wdwvadz4hvf7fn7/nominal_ordinal_binary.png?raw=1" width="75%" /> ] --- ## Data Basics Lecture Video On your own time, watch this video corresponding to the textbook section 1.2.
- **Question:** Which variable(s) in the `county` data set discussed in the video is discrete and why? --- ## Animal Shelter Variable Types Let's list the type for each variable in the animal shelter data. .pull-left[ .can-edit.key-tryedit[ - `animal_id` - `outcome` - `stay` - `animal` - `mf` - `age` - `in_month` - `in_year` ] ] .pull-right[ .can-edit.key-tryedit[ - `out_month` - `out_year` - `in_reason` - `in_intact` - `out_intact` - `name` - `breed` - `color` ] ] -- * **Note:** Unique ids such as `animal_id` are typically not treated as variables. --- ## Further Examples There is a data package associated with the textbook called `openintro`. -- * One of the data sets in the `openintro` package is the `satgpa` data set recording SAT and GPA data for 1000 students at an unnamed college. -- * Data records: Sex of the student; verbal, math, and total SAT percentiles; high school gpa; first year college gpa. -- * The first few rows of the data is displayed on the next slide. --- ## SAT and GPA Data .panelset[ .panel[.panel-name[Data portion] <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> sex </th> <th style="text-align:right;"> sat_v </th> <th style="text-align:right;"> sat_m </th> <th style="text-align:right;"> sat_sum </th> <th style="text-align:right;"> hs_gpa </th> <th style="text-align:right;"> fy_gpa </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 127 </td> <td style="text-align:right;"> 3.40 </td> <td style="text-align:right;"> 3.18 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 122 </td> <td style="text-align:right;"> 4.00 </td> <td style="text-align:right;"> 3.33 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 116 </td> <td style="text-align:right;"> 3.75 </td> <td style="text-align:right;"> 3.25 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 42 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 3.75 </td> <td style="text-align:right;"> 2.42 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 107 </td> <td style="text-align:right;"> 4.00 </td> <td style="text-align:right;"> 2.63 </td> </tr> </tbody> </table> ] .panel[.panel-name[R Code] ```r # display a few rows of data head(openintro::satgpa,5) %>% kbl() %>% kable_styling(full_width = F) ``` ] ] -- **Question:** What do you think is the type for each variable? --- ## Relationships Between Variables * Sometimes we are interested in only a single feature or variable. For example, in the animal shelter data, maybe we just want to get a sense of how long an animal spends in the shelter. -- * On the other hand, it is often more interesting (meaningful) to consider how features or variables relate. -- * A pair of variables are either related in some way (**associated**) or not (**independent**). No pair of variables is both associated and independent. -- * If two associated variables increase together, they are said to have a **positive association**. If one variable decreases as the other increases, they are said to have a **negative association**. -- * Do you think there are any associations between variables in the animal shelter data? -- * **Important:** An important point to always keep in mind is that in general, **association does not imply causation.** --- ## Explanatory and Response Variables Sometimes we are interested in a particular kind of relationship between variables. -- * For example, related to the animal shelter data, we might ask "are dogs more or less likely to stay longer in the shelter than cats?" That is, could the type of animal help to explain the length of time an animal stays in the shelter? -- * In the context of the example, the type of animal is an **explanatory** variable while the length of stay in the shelter is a **response** variable. --- ## Summary In this lecture, we covered 1) Data matrices, also known as data tables or data frames. This stores data in rows and columns with each row corresponding to an observation and each column to a variable. 2) Data types: numerical vs. categorical with a further breakdown depending on additional specifics. 3) Association between variables, and the notion of explanatory and response variables. --- ## Next Time In some ways, we have gotten ahead of ourselves. -- * Research or studies typically begin with a problem or question. Then, data is collected, possibly followed by a statistical analysis. -- * While this course doesn't focus on developing research questions, we do spend some time to emphasize ceratin concepts regarding data collection in relation to statistics. -- * In the next lecture, we cover sampling principles and contrast experimental vs. observational studies. -- * To get a head start, watch the videos embedded in the next two slides. --- ## Data Collection Principles Video
--- ## Sampling Strategies Video
--- ## Notes --- ## Notes --- ## Notes