CourseKata Chapter 3.10–3.13

The DGP • Sampling Variation • Simulations

Mansour Abdoli, PhD

Today (Ch. 3.10–3.13)

Session goals

By the end of today, you can:

  • Explain what Data Generating Process (DGP) means.
  • Use bottom‑up vs top‑down thinking to connect data ↔︎ DGP
  • Simulate a DGP with sample() / resample() and visualize results
  • Explain sampling variation and why small samples can look “weird”
  • Use simulations to see the law of large numbers in action

3.10 The Data Generating Process (DGP)

Data Generating Proccess (DGP)

  • Distributions Represents Variability in Data

  • Known Population
    • Fixed Variability
  • A Small Sample
    • Much different
  • A Large Sample;
    • Different, but less
  • Variation in part depends on the DGP
    • Key Idea: “What DGP generated population variability?”

Quick example: bus wait times (source)

distribution of some 60,000 waiting times at a NY bus stop
  • DGP thinking =
    using context to explain the distribution:
    • schedules?
    • passenger behavior?
    • delays, traffic, bunching?

3.11 Back-and-forth: data ↔︎ DGP

Two moves statisticians make

  • Top‑down: Theory of DGP → Expected Distribution

  • Bottom‑up: Observed Distribution → Predict Population/DGP

Sampling and Resampling

  • Sampling 1 Roll of a Die
  • Pair prompt:
    • Why do we get different numbers?
    • What stays the same?

3.12 From DGP → population → samples

GDP: Independent Samples

  • Sample with replacement (Independet Samples)
  • What happens if replace = FALSE (default)?
  • How the distribution resembles the population?

GDP: Large Independent Samples

  • 1000 independent samples:
  • Which one (10 or 1000) looks more like the population?
  • Which one has more sampling variation?

3.13 Weird DGPs and their samples

Weird populations can exist

  • This is a strange population shape -— but it’s still a valid distribution.

  • Could a sample represent this weird distribution?

Sampling from the Weird Population

  • Which sample “reveals” the W-shape better?
  • Which DGP is more variable?

Law of Large Numbers

  • As sample size increases:
    • the sample distribution tends to resemble the population distribution more closely.
  • A single small sample could do the same but less likely to do so:
    • Small samples can be misleading.
  • Law of Large Numbers works better on Statistics than Distribution! Why?