CourseKata Chapter 5

Simple Model: Prediction + Residual

Mansour Abdoli, PhD

Today: Representing Distributions with 1 Number

Session Goals

By the end of today, you can:

  • Explain error in mathematical & statistical models
  • Choose a statistic to represent a distribution
  • Compute and interpret residuals
  • Use lm() to build a simple model based on mean.
  • Use predict() and resid() to evaluate the performance of a model.

Mathematical vs. Statistical Estimation

Mathematical Modeling

Assume we need to estimate the area of a circle:

A circle with a set of inner and an outer squares

Using squares

A circle with a set of inner and an outer octagons

Using octagons
  • Area = Estimate + Error
  • We can control the error

Statistical Modeling of Distributions

Students’ Height:

  • Two Important questions:
    • How to estimate the distribution?
    • How to predict future samples?
  • Answer:
    1. Model the DGP process
    2. Estimate the model parameters.
    3. Control the error as much as possible.

A Simple DGP Model

Student’s Height:

Slightly skewed right centered around 65, taking values between 55 to 80

 min   Q1 median   Q3  max
  59 63.5   65.5 68.1 76.5
     mean       sd   n missing
 65.94682 3.547939 157       0
  • DGP: Data (Height) = Model + Error
  • Estimating Model (Center):
    • Mode (Most likely): \((62.5, 67.5]\)
    • Median (Middle number): \(65.5\)
    • Mean (Center of mass): \(65.95\)

Evaluating A DGP Model

  • DGP: Data (Height) = Model + Error
  • Estimating Model (Center):
    • Mode (Most likely): \((62.5, 67.5]\)
    • Median (Middle number): \(65.5\)
    • Mean (Center of mass): \(65.95\)
  • Estimating Error (Residual)
    • Mode:
      • Residual\(=1_{[Height==Mode]} \in \{0,1\}\)
    • Median & Mean:
      • Residual\(=Height - Center\)

Long-run Behavior

Hands on Simulation of Simple Models

A Toy Problem:

Predicting the next random selection from: \(2, 2, 2, 4, 6, 12\)

  • Form a Team (of 3 or 4)
  • Create a table with 15-20 rows:
# Guess Actural Match(0/1) Actual-Guess
1 - - - -
2 - - - -
  • Decide on a predicting procedure:

\[\begin{array}{ll} - \text{Mode} & - \text{Median}\\ - \text{Mean} & - \text{Random Guess} \end{array}\]

A Toy Problem (continue):

Predicting the next random selection from: \(2, 2, 2, 4, 6, 12\)

Repeate 15-20 times:


  • Guess the next value
  • Record it under Guess
  • Roll a die (real or digital)
  • Use it to pick the next observation.
  • Record the observation under Actual.
  • Compute Match and Diff columns.
  • How good was your guessing procedure?

Simulatoin of the Toy Problem

 min Q1 median  Q3 max
   2  2      3 5.5  11
 mean       sd n missing
  4.5 3.563706 6       0

Simulatoin Results

Simulation Results

Mean + Residual as Simple Model

  • Mean:
    • Balances Residuals: \(\sum e_i=\sum (y_i-\bar{y}) = 0\)
    • Minimizes \(\sum e_i^2=\sum (y_i-\bar{y})^2\)
  • R functions:

Models and Explaning Variability

  • Mean-Model is the Null Hypothesis
    • All variations is due to chance
  • The Alternative Hypothesis:
    • Added variables can explain some variability

Statistical Modeling Notation

Modeling Observed Data by Mean

  • Data = Observed Mean + Residual \[y_i = b_0 + e_i\]
    • \(b_0 = \bar y\) and
    • \(e_i = y_i - \bar y\)

Modeling (Population) GDP by Mean

  • Data = Population Mean + Error \[y_i = \beta_0 + \varepsilon_i\]
    • \(\beta_0 = \mu\) and
    • \(\varepsilon_i = y_i - \mu\)