CourseKata Chapter 3.10–3.13

The DGP • Sampling Variation • Simulations

Mansour Abdoli, PhD

Today (Ch. 3.10–3.13)

Session goals

By the end of today, you can:

Explain what Data Generating Process (DGP) means.
Use bottom‑up vs top‑down thinking to connect data ↔︎ DGP
Simulate a DGP with sample() / resample() and visualize results
Explain sampling variation and why small samples can look “weird”
Use simulations to see the law of large numbers in action

3.10 The Data Generating Process (DGP)

Data Generating Proccess (DGP)

Distributions Represents Variability in Data

Known Population
- Fixed Variability

A Small Sample
- Much different

A Large Sample;
- Different, but less

Variation in part depends on the DGP
- Key Idea: “What DGP generated population variability?”

Quick example: bus wait times (source)

distribution of some 60,000 waiting times at a NY bus stop

DGP thinking =
using context to explain the distribution:
- schedules?
- passenger behavior?
- delays, traffic, bunching?

3.11 Back-and-forth: data ↔︎ DGP

Two moves statisticians make

Top‑down: Theory of DGP → Expected Distribution

Bottom‑up: Observed Distribution → Predict Population/DGP

Sampling and Resampling

Sampling 1 Roll of a Die

Pair prompt:
- Why do we get different numbers?
- What stays the same?

3.12 From DGP → population → samples

GDP: Independent Samples

Sample with replacement (Independet Samples)

What happens if replace = FALSE (default)?
How the distribution resembles the population?

GDP: Large Independent Samples

1000 independent samples:

Which one (10 or 1000) looks more like the population?
Which one has more sampling variation?

3.13 Weird DGPs and their samples

Weird populations can exist

This is a strange population shape -— but it’s still a valid distribution.
Could a sample represent this weird distribution?

Sampling from the Weird Population

Which sample “reveals” the W-shape better?
Which DGP is more variable?

Law of Large Numbers

As sample size increases:
- the sample distribution tends to resemble the population distribution more closely.
A single small sample could do the same but less likely to do so:
- Small samples can be misleading.
Law of Large Numbers works better on Statistics than Distribution! Why?