3.3 — Omitted Variable Bias

ECON 480 • Econometrics • Fall 2020

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF20
metricsF20.classes.ryansafner.com

Review: u

Error term, includes all other variables that affect
Every regression model always has omitted variables assumed in the error
- Most unobservable (hence "u") or hard to measure
- Examples: innate ability, weather at the time, etc
Again, we assume is random, with and
Sometimes, omission of variables can bias OLS estimators and

Omitted Variable Bias I

Omitted variable bias (OVB) for some omitted variable exists if two conditionsa are met:

1. is a determinant of

i.e. is in the error term,

Omitted Variable Bias I

Omitted variable bias (OVB) for some omitted variable exists if two conditionsa are met:

Omitted Variable Bias I

Omitted variable bias (OVB) for some omitted variable exists if two conditionsa are met:

1. is a determinant of

i.e. is in the error term,

2. is correlated with the regressor

i.e.
implies
implies X is endogenous

Omitted Variable Bias II

Omitted variable bias makes X endogenous
- knowing tells you something about
- Knowing tells you something about not by way of !

Omitted Variable Bias III

is biased:
systematically over- or under-estimates the true relationship
“picks up” both:

Omited Variable Bias: Class Size Example

Example: Consider our recurring class size and test score example:

Which of the following possible variables would cause a bias if omitted?

Omited Variable Bias: Class Size Example

Example: Consider our recurring class size and test score example:

Which of the following possible variables would cause a bias if omitted?

: time of day of the test

Omited Variable Bias: Class Size Example

Example: Consider our recurring class size and test score example:

Which of the following possible variables would cause a bias if omitted?

: time of day of the test
: parking space per student

Omited Variable Bias: Class Size Example

Example: Consider our recurring class size and test score example:

Which of the following possible variables would cause a bias if omitted?

: time of day of the test
: parking space per student
: percent of ESL students

Recall: Endogeneity and Bias

The true expected value of is actually:^†

Recall: Endogeneity and Bias

The true expected value of is actually:^†

1) If is exogenous: , we're just left with

Recall: Endogeneity and Bias

The true expected value of is actually:^†

1) If is exogenous: , we're just left with

2) The larger is, larger bias:

Recall: Endogeneity and Bias

The true expected value of is actually:^†

1) If is exogenous: , we're just left with

2) The larger is, larger bias:

3) We can “sign” the direction of the bias based on

Positive overestimates the true is too high)
Negative underestimates the true is too low)

^† See 2.4 class notes for proof.

Endogeneity and Bias: Correlations I

Here is where checking correlations between variables helps:

# Select only the three variables we want (there are many)
CAcorr<-CASchool %>%
  select("str","testscr","el_pct")
# Make a correlation table
corr<-cor(CAcorr)
corr

##                str    testscr     el_pct
## str      1.0000000 -0.2263628  0.1876424
## testscr -0.2263628  1.0000000 -0.6441237
## el_pct   0.1876424 -0.6441237  1.0000000

el_pct is strongly (negatively) correlated with testscr (Condition 1)
el_pct is reasonably (positively) correlated with str (Condition 2)

Endogeneity and Bias: Correlations II

Here is where checking correlations between variables helps:

# Make a correlation plot
library(corrplot)
corrplot(corr, type="upper", 
         method = "number", # number for showing correlation coefficient
         order="original")

el_pct is strongly correlated with testscr (Condition 1)
el_pct is reasonably correlated with str (Condition 2)

Look at Conditional Distributions I# make a new variable called EL
# = high (if el_pct is above median) or = low (if below median)
CASchool<-CASchool %>% # next we create a new dummy variable called ESL
  mutate(ESL = ifelse(el_pct > median(el_pct), # test if ESL is above median
                     yes = "High ESL", # if yes, call this variable "High ESL"
                     no = "Low ESL")) # if no, call this variable "Low ESL"
# get average test score by high/low EL
CASchool %>%
  group_by(ESL) %>%
  summarize(Average_test_score=mean(testscr))
ABCDEFGHIJ0123456789
ESL
<chr>
Average_test_score
<dbl>
High ESL643.9591
Low ESL664.3540
2 rows
   

Look at Conditional Distributions II

ggplot(data = CASchool)+
  aes(x = testscr,
      fill = ESL)+
  geom_density(alpha=0.5)+
  labs(x = "Test Score",
       y = "Density")+
  ggthemes::theme_pander(
    base_family = "Fira Sans Condensed",
    base_size=20
    )+
  theme(legend.position = "bottom")

Look at Conditional Distributions III

esl_scatter<-ggplot(data = CASchool)+
  aes(x = str,
      y = testscr,
      color = ESL)+
  geom_point()+
  geom_smooth(method="lm")+
  labs(x = "STR",
       y = "Test Score")+
  ggthemes::theme_pander(
    base_family = "Fira Sans Condensed",
    base_size=20
    )+
  theme(legend.position = "bottom")
esl_scatter

Look at Conditional Distributions III

esl_scatter+
  facet_grid(~ESL)+
  guides(color = F)

Omitted Variable Bias in the Class Size Example

is positive (via )
is negative (via )
is negative (between Test score and STR)
Bias is positive
- But since is negative, it’s made to be a larger negative number than it truly is
- Implies that overstates the effect of reducing STR on improving Test Scores

Omitted Variable Bias: Messing with Causality I

If school districts with higher Test Scores happen to have both lower STR AND districts with smaller STR sizes tend to have less ...

Omitted Variable Bias: Messing with Causality I

If school districts with higher Test Scores happen to have both lower STR AND districts with smaller STR sizes tend to have less ...

How can we say estimates the marginal effect of ?

Omitted Variable Bias: Messing with Causality II

Consider an ideal random controlled trial (RCT)
Randomly assign experimental units (e.g. people, cities, etc) into two (or more) groups:
- Treatment group(s): gets a (certain type or level of) treatment
- Control group(s): gets no treatment(s)
Compare results of two groups to get average treatment effect

RCTs Neutralize Omitted Variable Bias I

Example: Imagine an ideal RCT for measuring the effect of STR on Test Score

School districts would be randomly assigned a student-teacher ratio
With random assignment, all factors in (family size, parental income, years in the district, day of the week of the test, climate, etc) are distributed independently of class size

RCTs Neutralize Omitted Variable Bias II

Example: Imagine an ideal RCT for measuring the effect of STR on Test Score

Thus, and , i.e. exogeneity
Our would be an unbiased estimate of , measuring the true causal effect of STR Test Score

But We Rarely, if Ever, Have RCTs

But our data is not an RCT, it is observational data!
“Treatment” of having a large or small class size is NOT randomly assigned!
: plausibly fits criteria of O.V. bias!
1. is a determinant of Test Score
2. is correlated with STR
Thus, “control” group and “treatment” group differs systematically!
- Small STR also tend to have lower ; large STR also tend to have higher
- Selection bias: ,

Treatment Group

Control Group

Another Way to Control for Variables

Causal pathways connecting str and test score:
- str test score
- str ESL testscore

Another Way to Control for Variables

Causal pathways connecting str and test score:
- str test score
- str ESL testscore
DAG rules tell us we need to control for ESL in order to identify the causal effect of
So now, how do we control for a variable?

Controlling for Variables

Look at effect of STR on Test Score by comparing districts with the same %EL.
- Eliminates differences in %EL between high and low STR classes
- “As if” we had a control group! Hold %EL constant
The simple fix is just to not omit %EL!
- Make it another independent variable on the righthand side of the regression

Treatment Group

Control Group

Controlling for Variables

Look at effect of STR on Test Score by comparing districts with the same %EL.
- Eliminates differences in %EL between high and low STR classes
- “As if” we had a control group! Hold %EL constant
The simple fix is just to not omit %EL!
- Make it another independent variable on the righthand side of the regression

The Multivariate Regression Model

Multivariate Econometric Models Overview

Y is the dependent variable of interest
- AKA "response variable," "regressand," "Left-hand side (LHS) variable"

Multivariate Econometric Models Overview

Y is the dependent variable of interest
- AKA "response variable," "regressand," "Left-hand side (LHS) variable"

X1 and X2 are independent variables
- AKA "explanatory variables", "regressors," "Right-hand side (RHS) variables", "covariates"

Multivariate Econometric Models Overview

Y is the dependent variable of interest
- AKA "response variable," "regressand," "Left-hand side (LHS) variable"

X1 and X2 are independent variables
- AKA "explanatory variables", "regressors," "Right-hand side (RHS) variables", "covariates"

Our data consists of a spreadsheet of observed values of

Multivariate Econometric Models Overview

Y is the dependent variable of interest
- AKA "response variable," "regressand," "Left-hand side (LHS) variable"

X1 and X2 are independent variables
- AKA "explanatory variables", "regressors," "Right-hand side (RHS) variables", "covariates"

Our data consists of a spreadsheet of observed values of

To model, we "regress Y on and "

Multivariate Econometric Models Overview

Y is the dependent variable of interest
- AKA "response variable," "regressand," "Left-hand side (LHS) variable"

X1 and X2 are independent variables
- AKA "explanatory variables", "regressors," "Right-hand side (RHS) variables", "covariates"

Our data consists of a spreadsheet of observed values of

To model, we "regress Y on and "

β0,β1,⋯,βk are parameters that describe the population relationships between the variables
- We estimate parameters (“betas”)^†

^† Note Bailey defines k to include both the number of variables plus the constant.

Marginal Effects I

Consider changing by while holding constant:

Marginal Effects I

Consider changing by while holding constant:

Marginal Effects I

Consider changing by while holding constant:

Marginal Effects I

Consider changing by while holding constant:

Marginal Effects II

Similarly, for :

Marginal Effects II

Similarly, for :

And for the constant, :

You Can Keep Your Intuitions...But They're Wrong Now

We have been envisioning OLS regressions as the equation of a line through a scatterplot of data on two variables, and
- : "intercept"
- : "slope"
With 3+ variables, OLS regression is no longer a "line" for us to estimate

The "Constant"

Alternatively, we can write the population regression equation as:
Here, we added to
is a constant regressor, as we define for all observations
Likewise, is more generally called the “constant” term in the regression (instead of the “intercept”)
This may seem silly and trivial, but this will be useful next class!

The Population Regression Model: Example I

Example:

Let's see what you remember from micro(econ)!

The Population Regression Model: Example I

Example:

Let's see what you remember from micro(econ)!
What measures the price effect? What sign should it have?

The Population Regression Model: Example I

Example:

Let's see what you remember from micro(econ)!
What measures the price effect? What sign should it have?
What measures the income effect? What sign should it have? What should inferior or normal (necessities & luxury) goods look like?

The Population Regression Model: Example I

Example:

Let's see what you remember from micro(econ)!
What measures the price effect? What sign should it have?
What measures the income effect? What sign should it have? What should inferior or normal (necessities & luxury) goods look like?
What measures the cross-price effect(s)? What sign should substitutes and complements have?

The Population Regression Model: Example I

Example:

Interpret each

Multivariate OLS in R# run regression of testscr on str and el_pct
school_reg_2 <- lm(testscr ~ str + el_pct, 
                 data = CASchool)

Format for regression is lm(y ~ x1 + x2, data = df)
y is dependent variable (listed first!)
~ means “modeled by”
x1 and x2 are the independent variable
df is the dataframe where the data is stored

   

Multivariate OLS in R II# look at reg object
school_reg_2

## 
## Call:
## lm(formula = testscr ~ str + el_pct, data = CASchool)
## 
## Coefficients:
## (Intercept)          str       el_pct  
##    686.0322      -1.1013      -0.6498
Stored as an lm object called school_reg_2, a list object

   

Multivariate OLS in R IIIsummary(school_reg_2) # get full summary

## 
## Call:
## lm(formula = testscr ~ str + el_pct, data = CASchool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.845 -10.240  -0.308   9.815  43.461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 686.03225    7.41131  92.566  < 2e-16 ***
## str          -1.10130    0.38028  -2.896  0.00398 ** 
## el_pct       -0.64978    0.03934 -16.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264,    Adjusted R-squared:  0.4237 
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16
   

Multivariate OLS in R IV: broom

# load packages
library(broom)
# tidy regression output
tidy(school_reg_2)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>
(Intercept)	686.0322487	7.41131248
str	-1.1012959	0.38027832
el_pct	-0.6497768	0.03934255

Multivariate Regression Output Tablelibrary(huxtable)
huxreg("Model 1" = school_reg,
       "Model 2" = school_reg_2,
       coefs = c("Intercept" = "(Intercept)",
                 "Class Size" = "str",
                 "%ESL Students" = "el_pct"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)

Model 1Model 2

Intercept698.93 ***686.03 ***

(9.47)   (7.41)   

Class Size-2.28 ***-1.10 ** 

(0.48)   (0.38)   

%ESL Students       -0.65 ***

       (0.04)   

N420       420       

R-Squared0.05    0.43    

SER18.58    14.46    

 *** p < 0.001;  ** p < 0.01;  * p < 0.05.

	Model 1	Model 2
Intercept	698.93 ***	686.03 ***
	(9.47)	(7.41)
Class Size	-2.28 ***	-1.10 **
	(0.48)	(0.38)
%ESL Students		-0.65 ***
		(0.04)
N	420	420
R-Squared	0.05	0.43
SER	18.58	14.46
* p < 0.001; p < 0.01; * p < 0.05.

Review: u

Error term, includes all other variables that affect

Every regression model always has omitted variables assumed in the error

Most unobservable (hence "u") or hard to measure
Examples: innate ability, weather at the time, etc

Again, we assume is random, with and

Sometimes, omission of variables can bias OLS estimators and

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

3.3 — Omitted Variable Bias

ECON 480 • Econometrics • Fall 2020

Ryan Safner Assistant Professor of Economics safner@hood.edu ryansafner/metricsF20 metricsF20.classes.ryansafner.com

Review: u

Omitted Variable Bias I

Omitted Variable Bias I

Omitted Variable Bias I

Omitted Variable Bias II

Omitted Variable Bias III

Omited Variable Bias: Class Size Example

Omited Variable Bias: Class Size Example

Omited Variable Bias: Class Size Example

Omited Variable Bias: Class Size Example

Recall: Endogeneity and Bias

Recall: Endogeneity and Bias

Recall: Endogeneity and Bias

Recall: Endogeneity and Bias

Endogeneity and Bias: Correlations I

Endogeneity and Bias: Correlations II

Look at Conditional Distributions I

Look at Conditional Distributions II

Look at Conditional Distributions III

Look at Conditional Distributions III

Omitted Variable Bias in the Class Size Example

Omitted Variable Bias: Messing with Causality I

Omitted Variable Bias: Messing with Causality I

Omitted Variable Bias: Messing with Causality II

RCTs Neutralize Omitted Variable Bias I

RCTs Neutralize Omitted Variable Bias II

But We Rarely, if Ever, Have RCTs

Another Way to Control for Variables

Another Way to Control for Variables

Controlling for Variables

Controlling for Variables

The Multivariate Regression Model

Multivariate Econometric Models Overview

Multivariate Econometric Models Overview

Multivariate Econometric Models Overview

Multivariate Econometric Models Overview

Multivariate Econometric Models Overview

Multivariate Econometric Models Overview

Marginal Effects I

Marginal Effects I

Marginal Effects I

Marginal Effects I

Marginal Effects II

Marginal Effects II

Marginal Effects II

You Can Keep Your Intuitions...But They're Wrong Now

The "Constant"

The Population Regression Model: Example I

The Population Regression Model: Example I

The Population Regression Model: Example I

The Population Regression Model: Example I

The Population Regression Model: Example I

Multivariate OLS in R

Multivariate OLS in R II

Multivariate OLS in R III

Multivariate OLS in R IV: broom

Multivariate Regression Output Table

Review: u

Help

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF20
metricsF20.classes.ryansafner.com