CourseKata Chapter 9 (9 to 11)

Linear Regression

Mansour Abdoli, PhD

Overview / Goals

  • Correlation (r)
  • Correlation matrices
  • Relationship between r and R²
  • Null distribution of slopes
  • Limits of regression models

Two Numerical Variables

Previously we modeled:

  • response variable (y)
  • explanatory variable (x)

with regression

\[ \hat{y} = b_0 + b_1x \]

Key question:

How strong is the relationship?

Pearson’s Correlation (r)

Pearson’s (r):

  • measures strength and direction
  • of a linear relationship
  • between two quantitative variables

Range:

\[ -1 \le r \le 1 \]

Interpreting r

r value Meaning
close to 1 strong positive relationship
close to -1 strong negative relationship
close to 0 weak or no linear relationship

Important:

Correlation measures linear association.

Example Scatterplot

Questions:

  • What direction is the relationship?
  • How strong is the relationship?

Calculating Correlation

Pearson’s r (using R):

cor(mtcars$wt, mtcars$mpg)
[1] -0.8676594

Interpretation:

  • sign → direction
  • magnitude → strength

Correlation Matrix

Pairwise correlations for multiple variables:

Example:

round(cor(mtcars[,c("mpg","wt","hp","disp")]),2)
       mpg    wt    hp  disp
mpg   1.00 -0.87 -0.78 -0.85
wt   -0.87  1.00  0.66  0.89
hp   -0.78  0.66  1.00  0.79
disp -0.85  0.89  0.79  1.00
  • Practice: Interpret some the values.

Relationship Between r and R²

For simple linear regression: \[R^2 = r^2\] Where:

  • \(r\) = correlation
  • \(R^2\) = proportion of variation explained

Example:

\(r^2:\)

cor(mtcars$wt, mtcars$mpg)^2
[1] 0.7528328

PRE:

pre(lm(mpg ~ wt, data = mtcars))
[1] 0.7528328

Understanding the Slope

Recall the regression model: \[\hat{y} = b_0 + b_1x\]

The slope (b_1):

change in predicted (y) for a one-unit increase in (x).

But how do we know if this slope is meaningful?

Null Model Idea

  • The slope is zero: \(\beta_1=0\)

  • No relationship between \(x\) and \(y\)

  • Any observed relation is due to random variation.

  • Simulate random variation:
    1. Shuffle the response variable.
    2. Recalculate the slope.
    3. Repeat many times.

Simulation Example

Limitation of Regression

  1. Association vs causation
  2. Extrapolation
  3. Linearity assumption

Association vs Causation

Regression detects association, not causation.

Example:

Ice cream sales ↑
Drowning incidents ↑

  • Does ice cream cause drowning?

Extrapolation

Where models are reliable?

  • within the observed data range.

Example:

Linearity Assumption

Linear regression assumes:

\[\text{relationship = straight line}\]

But some relationships are curved.

Example patterns:

  • quadratic
  • exponential
  • saturation effects

Example Nonlinear Pattern

A straight line would not describe this well.

Diagnosing Linearity

We check:

  • scatterplots
    • Linear relation
  • residual plots
    • No systematic pattern.

Key Takeaways

  • Correlation (r): measures strength of linear association
  • Relationship: \(R^2 = r^2\)
  • Null slope distribution:
    • helps evaluate whether slopes occur by chance
  • Regression limitations:
    • association ≠ causation
    • avoid extrapolation
    • check linearity