CourseKata Chapter 2.8–2.11

Sampling • Tidy Data • Missing Data • Creating Variables

Mansour Abdoli, PhD

Today (Ch. 2.8–2.11)

Session goals

By the end of today, you can:

Explain population vs. sample
Explain why random + independent sampling matters
Describe tidy data (rows/columns) and work with data frames
Recognize and handle missing data (NA) in a principled way
Create and recode variables to answer a question

2.8 Sampling from a population

Big ideas

Population: the universe of cases we want to learn about
Sample: the cases we actually measure
Two key choices in any sampling plan:
1. Who counts as part of the population?
2. How do we select cases from that population?

Random & independent sampling

Random: each case has an equal chance of being selected
Independent: picking one case doesn’t change the chance of picking another

Why we care:

Many statistical methods assume your sample is (approximately) random and independent.
When these assumptions fail, conclusions can be misleading.

Try it: sampling variation with a “fake population”

Now take a random sample of 10 and compare.

Visual: histogram of your sample (n = 10)

2.9 The structure of data (tidy data)

The tidy-data idea

A useful default structure:

Each column is a variable
Each row is a case (observation)
Each type of case belongs in its own table

Meet `coursekata` data frame: `Fingers`

Think:
- What are the cases?
- What are the variables?

Selecting a few columns (CourseKata-style)

Often we only want a few rows, too:

Filtering rows (subsetting cases)

Example: keep only Gender == "female".

Question: - How many females are in this sample?

2.10 Missing data

What is missing data in R?

R uses the special value NA (“not available”) to represent missingness.

NA is not the same as the text "NA".

Identify missing values

Two common strategies

Drop rows with missing values (sometimes OK, sometimes not)

Filter only the rows you need (more targeted)

Activity: “missingness decisions” (group work)

In groups of 3–4:

Imagine SSLast is important to your analysis.
Decide which approach you’d use:
- na.omit() on the whole data frame
- targeted filter() for just SSLast
- something else

Be ready to justify your choice in 1–2 sentences.

2.11 Creating and recoding variables

Creating a summary variable (logical)

CourseKata example idea: “Is ring finger longer than index finger?”

Creating a ratio variable

Creating an average across multiple variables

Recoding: turn a quantitative variable into categories

Example: make a “HighRatio” indicator (above 1 vs not).

Now cross-tab by Gender:

Wrap-up: what we just practiced

Sampling → variability shows up even when the population is “known”
Tidy data → rows = cases, columns = variables
Missing data → NA requires explicit handling
New variables → make your analysis easier (and your models possible)

Exit ticket (2–3 minutes)

Write a short response:

One sentence: define population vs sample
One sentence: what makes data “tidy”?
One R line: remove rows where SSLast is NA (don’t use na.omit())