class: center, middle, inverse, title-slide .title[ # MATH 204 Introduction to Statistics ] .subtitle[ ## Lecture 2: Sampling ] .author[ ### JMG ] --- ## Goals for Lecture * Introduce basic concepts related to data collection and sampling (textbook Chapter 1): -- * Observational vs. experimental studies (textbook sections 1.2.5, 1.3.4, and 1.4) * Populations and samples (textbook section 1.3.1) * Issue of bias (textbook section 1.3.3) * Sampling strategies (textbook section 1.3.5) --- ## Research Methods * The first step in conducting research is to identify topics or questions that are to be investigated. -- * The next step is to collect data. -- * Good research practices dictate a careful consideration of how data is collected. -- * It is important to distinguish observational studies from experimental studies. -- * In an **observational study** the data collection process does not interfere with the subjects of the study. -- * An **experimental study** involves a manipulation of the study subjects. --- ## Observational Studies * Suppose that we want to know what University of Scranton students do or do not like about the buildings on campus. We could survey current students with questions that ask for ratings on different aspects of campus buildings. -- * The data collected through such a survey is an example of **observational data**. Why? -- * Making causal conclusions based on observational data is **not** recommended. --- ## Experimental Studies * Suppose that we want know if caffeine consumption influences the exam performance of University of Scranton students. To study this, we give all of the students in one section of BIOL 141 a certain dose of caffeine at the mid-term exam while asking all of the students in another section of BIOL 141 to refrain from consuming caffeine before the mid-term exam and record the mid-term exam scores of both sections. -- * This is an example of **experimental data**. Why? -- * Note that there is an **experimental group** (*i.e.*, those consuming caffeine) and a **control group** (*i.e.*, those refraining from caffeine). -- * In an experimental study there are often both explanatory and response variables. -- .can-edit.key-tryedit[ * In our example, the mid-term exam score is the (which) variable while caffeine consumption is the (which) variable. ] --- ## Populations * In the context of statistics (and research that intends to use statistical methods for data analysis), a [**population**](https://en.wikipedia.org/wiki/Statistical_population) is the overall group (which may be real or hypothetical) that is the focus of a research question. -- * In our examples for both an observational and an experimental study, our research question is about the population of students at the University of Scranton. -- * Populations tend to be very large because if a population is small, then each member of the population can be observed directly and (inferential) statistics is not needed. -- * Typically, it is somewhere between inconvenient and impossible to collect data for every case or member in a population. -- * As examples, think about the observational study on UofS building preferences and the experimental study on UofS student exam performance and caffeine consumption. In either case, would it be feasible to survey or observe every student? --- ## Samples * Researchers use statistics to analyze data collected from a **sample** to make conclusions about a population from which the data were sampled. -- * Here are some examples: -- * A factory makes a lot of items (the population) but randomly selects a few of those items to test for quality. -- * Data from current patients in a hospital (a sample) is used to make conclusions about future patients (the population). -- * Why would data from a laboratory experiment typically be a sample? --- ## Bias * One common goal of statistics is to use a sample to make generalizations about the population to which the sample belongs. -- * One would expect that if such a goal is to be achieved, then the sample needs to be **representative** of the population. -- * A sample that for one reason or another fails to be representative of the target population is said to be a **biased** sample. -- * Suppose that we want to know about the majors of students at the University of Scranton that take a statistics course. We conduct a study that involves asking every student enrolled in MATH 204 in Fall 2022 what is their major. Is this a good strategy? Why or why not? -- * In sampling for statistical purposes, one should always seek to **randomly** select a sample from a population in order to reduce the risk of bias. --- ## Sampling Techniques * We describe four techniques for random sampling. -- * Simple random sampling - selects sample members individually in some way that is random. Like drawing names from a hat. -- * Stratified random sampling - groups similar individuals into strata and then randomly samples from each strata. For example, group students into 4 cohorts, then randomly select 10 students from each cohort. -- * Cluster sampling - data are binned into clusters and then a sample of clusters is randomly chosen. Consider all of the different classes running at the University of Scranton in Fall 2022 (the clusters), now randomly choose all students from ten of these classes. -- * Multistage sampling - data are binned into clusters, a sample of clusters is randomly chosen, then a sample of individuals from each cluster is chosen. Consider all of the different classes running at the University of Scranton in Fall 2022 (the clusters), randomly choose all ten of these classes, and finally select at random three students from each of the ten chosen classes. -- * The next four slides provide visual representations of each sampling method. --- ## Simple Random Sampling <img src="index_files/figure-html/srs-plot-1.png" width="100%" height="600" /> --- ## Stratified Random Sampling <img src="index_files/figure-html/stratified-plot-1.png" width="100%" height="600" /> --- ## Cluster Sampling <img src="index_files/figure-html/cluster-plot-1.png" width="100%" height="600" /> --- ## Multistage Sampling <img src="index_files/figure-html/multistage-plot-1.png" width="100%" height="600" /> --- ## Which Sampling? * How do we decide which sampling method to use? -- * It is unlikely that any sampling method will be perfect for a particular situation. Always keep in mind that the goal is to reduce bias and obtain a sample that is as representative as possible. -- * Of course, there are sampling methods besides what we have discussed here. The point is to be aware of problems that arise for sampling and to think through a study design in order to minimize these problems as much as possible. --- ## Summary In this lecture, we introduced the essential concepts of data collection: -- * observational vs. experimental data -- * populations and samples -- * bias and sampling techniques --- ## Next Time * In the next lecture, we introduce the basics of R that will be used in Chapter 2 on summarizing data. -- * Good references for R as we will use it in this course include: -- * The [swirl](https://swirlstats.com/) course on [R programming](https://github.com/swirldev/R_Programming_E). * The [Intro to R blog](https://jennysloane.netlify.app/blog/intro_to_r/) by [Jenny Sloane](https://jennysloane.netlify.app/#about) .center[ <img src="https://www.dropbox.com/s/bl4u8njjco8cj0n/exploder.gif?raw=1" width="200" height="200" /> ] --- ## Notes --- ## Notes --- ## Notes