CourseKata Chapter 2.8–2.11

Sampling • Tidy Data • Missing Data • Creating Variables

Mansour Abdoli, PhD

Today (Ch. 2.8–2.11)

Session goals

By the end of today, you can:

  • Explain population vs. sample
  • Explain why random + independent sampling matters
  • Describe tidy data (rows/columns) and work with data frames
  • Recognize and handle missing data (NA) in a principled way
  • Create and recode variables to answer a question

2.8 Sampling from a population

Big ideas

  • Population: the universe of cases we want to learn about
  • Sample: the cases we actually measure
  • Two key choices in any sampling plan:
    1. Who counts as part of the population?
    2. How do we select cases from that population?

Random & independent sampling

  • Random: each case has an equal chance of being selected
  • Independent: picking one case doesn’t change the chance of picking another

Why we care:

  • Many statistical methods assume your sample is (approximately) random and independent.
  • When these assumptions fail, conclusions can be misleading.

Try it: sampling variation with a “fake population”

Now take a random sample of 10 and compare.

Visual: histogram of your sample (n = 10)

Think, Pair, Share:

  • What do you expect if we repeat the sampling again?
  • What happens if we increase the sample size?

Discuss: Which sample looks “more representative” of the population? Why?

2.9 The structure of data (tidy data)

The tidy-data idea

A useful default structure:

  1. Each column is a variable
  2. Each row is a case (observation)
  3. Each type of case belongs in its own table

Meet coursekata data frame: Fingers

Think:
- What are the cases?
- What are the variables?

Selecting a few columns (CourseKata-style)

Often we only want a few rows, too:

Filtering rows (subsetting cases)

Example: keep only Gender == "female".

Question: - How many females are in this sample?

2.10 Missing data

What is missing data in R?

  • R uses the special value NA (“not available”) to represent missingness.
  • NA is not the same as the text "NA".

Identify missing values

Two common strategies

  1. Drop rows with missing values (sometimes OK, sometimes not)
  1. Filter only the rows you need (more targeted)

Activity: “missingness decisions” (group work)

In groups of 3–4:

  • Imagine SSLast is important to your analysis.
  • Decide which approach you’d use:
    • na.omit() on the whole data frame
    • targeted filter() for just SSLast
    • something else

Be ready to justify your choice in 1–2 sentences.

2.11 Creating and recoding variables

Creating a summary variable (logical)

CourseKata example idea: “Is ring finger longer than index finger?”

Creating a ratio variable

Creating an average across multiple variables

Recoding: turn a quantitative variable into categories

Example: make a “HighRatio” indicator (above 1 vs not).

Now cross-tab by Gender:

Wrap-up: what we just practiced

  • Sampling → variability shows up even when the population is “known”
  • Tidy data → rows = cases, columns = variables
  • Missing data → NA requires explicit handling
  • New variables → make your analysis easier (and your models possible)

Exit ticket (2–3 minutes)

Write a short response:

  1. One sentence: define population vs sample
  2. One sentence: what makes data “tidy”?
  3. One R line: remove rows where SSLast is NA (don’t use na.omit())