class: center, middle, inverse, title-slide # 3.3 — Omitted Variable Bias ## ECON 480 • Econometrics • Fall 2020 ### Ryan Safner
Assistant Professor of Economics
--- # Review: u `$$Y_i=\beta_0+\beta_1X_i+u_i$$` - Error term, `\(u_i\)` includes .hi-purple[all other variables that affect `\\(Y\\)`] - Every regression model always has .hi[omitted variables] assumed in the error - Most unobservable (hence "*u*") or hard to measure - .green[**Examples**:] innate ability, weather at the time, etc - Again, we *assume* `\(u\)` is random, with `\(E[u|X]=0\)` and `\(var(u)=\sigma^2_u\)` - *Sometimes*, omission of variables can **bias** OLS estimators `\((\hat{\beta_0}\)` and `\(\hat{\beta_1})\)` --- # Omitted Variable Bias I .pull-left[ - .hi[Omitted variable bias (OVB)] for some omitted variable `\(\mathbf{Z}\)` exists if two conditionsa are met: **1. `\(Z\)` is a determinant of `\(Y\)`** - i.e. `\(Z\)` is in the error term, `\(u_i\)` ] .pull-right[ ] --- # Omitted Variable Bias I .pull-left[ - .hi[Omitted variable bias (OVB)] for some omitted variable `\(\mathbf{Z}\)` exists if two conditionsa are met: ] .pull-right[ <!-- --> ] --- # Omitted Variable Bias I .pull-left[ - .hi[Omitted variable bias (OVB)] for some omitted variable `\(\mathbf{Z}\)` exists if two conditionsa are met: **1. `\(Z\)` is a determinant of `\(Y\)`** - i.e. `\(Z\)` is in the error term, `\(u_i\)` **2. `\(Z\)` is correlated with the regressor `\(X\)`** - i.e. `\(cor(X,Z) \neq 0\)` - implies `\(cor(X,U) \neq 0\)` - implies .hi-purple[X is endogenous] ] .pull-right[ <!-- --> ] --- # Omitted Variable Bias II .pull-left[ - Omitted variable bias makes `\(X\)` .hi-purple[endogenous] - `\(E(u_i|X_i)\neq 0 \implies\)` knowing `\(X\)` tells you something about `\(u_i\)` - Knowing `\(X\)` tells you something about `\(Y\)` *not* by way of `\(X\)`! ] .pull-right[ <!-- --> ] --- # Omitted Variable Bias III .pull-left[ - `\(\hat{\beta_1}\)` is .hi-purple[biased]: `\(E[\hat{\beta_1}] \neq \beta_1\)` - `\(\hat{\beta_1}\)` systematically over- or under-estimates the true relationship `\((\beta_1)\)` - `\(\hat{\beta_1}\)` “picks up” *both*: - `\(X\rightarrow Y\)` - `\(X \leftarrow Z\rightarrow Y\)` ] .pull-right[ <!-- --> ] --- # Omited Variable Bias: Class Size Example .content-box-green[ .green[**Example**]: Consider our recurring class size and test score example: `$$\text{Test score}_i = \beta_0 + \beta_1 \text{STR}_i + u_i$$` ] - Which of the following possible variables would cause a bias if omitted? -- 1. `\(Z_i\)`: time of day of the test -- 2. `\(Z_i\)`: parking space per student -- 3. `\(Z_i\)`: percent of ESL students --- # Recall: Endogeneity and Bias .smaller[ - The true expected value of `\(\hat{\beta_1}\)` is actually:<sup>.magenta[†]</sup> `$$E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}$$` ] -- .smallest[ 1) If `\(X\)` is exogenous: `\(cor(X,u)=0\)`, we're just left with `\(\beta_1\)` ] -- .smallest[ 2) The larger `\(cor(X,u)\)` is, larger .hi-purple[bias]: `\(\left(E[\hat{\beta_1}]-\beta_1 \right)\)` ] -- .smallest[ 3) We can .hi-purple[“sign”] the direction of the bias based on `\(cor(X,u)\)` - .hi-purple[Positive] `\(cor(X,u)\)` overestimates the true `\(\beta_1\)` `\((\hat{\beta_1}\)` is too high) - .hi-purple[Negative] `\(cor(X,u)\)` underestimates the true `\(\beta_1\)` `\((\hat{\beta_1}\)` is too low) ] .footnote[.quitesmall[ <sup>.magenta[†]</sup> See [2.4 class notes](/class/2.4-class) for proof.] ] --- # Endogeneity and Bias: Correlations I - Here is where checking correlations between variables helps: .pull-left[ .code50[ ```r # Select only the three variables we want (there are many) CAcorr<-CASchool %>% select("str","testscr","el_pct") # Make a correlation table corr<-cor(CAcorr) corr ``` ``` ## str testscr el_pct ## str 1.0000000 -0.2263628 0.1876424 ## testscr -0.2263628 1.0000000 -0.6441237 ## el_pct 0.1876424 -0.6441237 1.0000000 ``` ] ] .pull-right[ - `el_pct` is strongly (negatively) correlated with `testscr` (Condition 1) - `el_pct` is reasonably (positively) correlated with `str` (Condition 2) ] --- # Endogeneity and Bias: Correlations II - Here is where checking correlations between variables helps: .pull-left[ ```r # Make a correlation plot library(corrplot) corrplot(corr, type="upper", method = "number", # number for showing correlation coefficient order="original") ``` ] .pull-right[ <img src="3.3-slides_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] - `el_pct` is strongly correlated with `testscr` (Condition 1) - `el_pct` is reasonably correlated with `str` (Condition 2) --- # Look at Conditional Distributions I .smallest[ .code50[ ```r # make a new variable called EL # = high (if el_pct is above median) or = low (if below median) CASchool<-CASchool %>% # next we create a new dummy variable called ESL mutate(ESL = ifelse(el_pct > median(el_pct), # test if ESL is above median yes = "High ESL", # if yes, call this variable "High ESL" no = "Low ESL")) # if no, call this variable "Low ESL" # get average test score by high/low EL CASchool %>% group_by(ESL) %>% summarize(Average_test_score=mean(testscr)) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["ESL"],"name":[1],"type":["chr"],"align":["left"]},{"label":["Average_test_score"],"name":[2],"type":["dbl"],"align":["right"]}],"data":[{"1":"High ESL","2":"643.9591"},{"1":"Low ESL","2":"664.3540"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] --- # Look at Conditional Distributions II .pull-left[ .code50[ ```r ggplot(data = CASchool)+ aes(x = testscr, fill = ESL)+ geom_density(alpha=0.5)+ labs(x = "Test Score", y = "Density")+ ggthemes::theme_pander( base_family = "Fira Sans Condensed", base_size=20 )+ theme(legend.position = "bottom") ``` ] ] .pull-right[ <img src="3.3-slides_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] --- # Look at Conditional Distributions III .pull-left[ .code50[ ```r esl_scatter<-ggplot(data = CASchool)+ aes(x = str, y = testscr, color = ESL)+ geom_point()+ geom_smooth(method="lm")+ labs(x = "STR", y = "Test Score")+ ggthemes::theme_pander( base_family = "Fira Sans Condensed", base_size=20 )+ theme(legend.position = "bottom") esl_scatter ``` ] ] .pull-right[ <img src="3.3-slides_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- # Look at Conditional Distributions III .pull-left[ ```r esl_scatter+ * facet_grid(~ESL)+ * guides(color = F) ``` ] .pull-right[ <img src="3.3-slides_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- # Omitted Variable Bias in the Class Size Example .center[ `$$E[\hat{\beta_1}]=\beta_1+bias$$` `\(E[\hat{\beta_1}]=\)` .red[`\\(\beta_1\\)`] `\(+\)` .blue[`\\(cor(X,u)\\)`] `\(\frac{\sigma_u}{\sigma_X}\)` ] - .blue[`\\(cor(STR,u)\\)`] is positive (via `\(\%EL\)`) - `\(cor(u, \text{Test score})\)` is negative (via `\(\%EL\)`) - .red[`\\(\beta_1\\)`] is negative (between Test score and STR) - .blue[Bias] is positive - But since `\(\color{red}{\beta_1}\)` is negative, it’s made to be a *larger* negative number than it truly is - Implies that `\(\color{red}{\beta_1}\)` *over*states the effect of reducing STR on improving Test Scores --- # Omitted Variable Bias: Messing with Causality I If school districts with higher Test Scores happen to have both lower STR **AND** districts with smaller STR sizes tend to have less `\(\%EL\)` ... -- - How can we say `\(\hat{\beta_1}\)` estimates the **marginal effect** of `\(\Delta STR \rightarrow \Delta \text{Test Score}\)`? --- # Omitted Variable Bias: Messing with Causality II .pull-left[ - Consider an ideal .hi-turquoise[random controlled trial (RCT)] - .hi-turquoise[Randomly] assign experimental units (e.g. people, cities, etc) into two (or more) groups: - .hi[Treatment group(s)]: gets a (certain type or level of) treatment - .hi-purple[Control group(s)]: gets *no* treatment(s) - Compare results of two groups to get .hi-slate[average treatment effect] ] .pull-right[ .center[  ] ] --- # RCTs Neutralize Omitted Variable Bias I .content-box-green[ .green[**Example**]: Imagine an ideal RCT for measuring the effect of STR on Test Score ] .pull-left[ - School districts would be .hi-turquoise[randomly assigned] a student-teacher ratio - With random assignment, all factors in `\(u\)` (family size, parental income, years in the district, day of the week of the test, climate, etc) are distributed *independently* of class size ] .pull-right[ .center[  ] ] --- # RCTs Neutralize Omitted Variable Bias II .content-box-green[ .green[**Example**]: Imagine an ideal RCT for measuring the effect of STR on Test Score ] .pull-left[ - Thus, `\(cor(STR, u)=0\)` and `\(E[u|STR]=0\)`, i.e. .hi-purple[exogeneity] - Our `\(\hat{\beta_1}\)` would be an unbiased estimate of `\(\beta_1\)`, measuring the true causal effect of STR `\(\rightarrow\)` Test Score ] .pull-right[ .center[  ] ] --- # But We Rarely, if Ever, Have RCTs .pull-left[ .smallest[ - But our data is *not* an RCT, it is observational data! - “Treatment” of having a large or small class size is **NOT** randomly assigned! - `\(\%EL\)`: plausibly fits criteria of O.V. bias! 1. `\(\%EL\)` is a determinant of Test Score 2. `\(\%EL\)` is correlated with STR - Thus, “control” group and “treatment” group differs systematically! - Small STR also tend to have lower `\(\%EL\)`; large STR also tend to have higher `\(\%EL\)` - .hi-orange[Selection bias]: `\(cor(STR, \%EL) \neq 0\)`, `\(E[u_i|STR_i]\neq 0\)` ] ] .pull-right[ .pull-left[ .center[  Treatment Group ] ] .pull-right[ .center[  Control Group ] ] ] --- # Another Way to Control for Variables .pull-left[ - Causal pathways connecting str and test score: - str `\(\rightarrow\)` test score - str `\(\leftarrow\)` ESL `\(\rightarrow\)` testscore ] .pull-right[ <!-- --> ] --- # Another Way to Control for Variables .pull-left[ - Causal pathways connecting str and test score: - str `\(\rightarrow\)` test score - str `\(\leftarrow\)` ESL `\(\rightarrow\)` testscore - DAG rules tell us we need to .hi-purple[control for ESL] in order to identify the causal effect of - So now, .hi-turquoise[how *do* we control for a variable]? ] .pull-right[ <!-- --> ] --- # Controlling for Variables .pull-left[ - Look at effect of STR on Test Score by comparing districts with the **same** %EL. - Eliminates differences in %EL between high and low STR classes - “As if” we had a control group! Hold %EL constant - The simple fix is just to .hi-purple[not omit %EL]! - Make it *another* independent variable on the righthand side of the regression ] .pull-right[ .pull-left[ .center[  Treatment Group ] ] .pull-right[ .center[  Control Group ] ] ] --- # Controlling for Variables .pull-left[ - Look at effect of STR on Test Score by comparing districts with the **same** %EL. - Eliminates differences in %EL between high and low STR classes - “As if” we had a control group! Hold %EL constant - The simple fix is just to .hi-purple[not omit %EL]! - Make it *another* independent variable on the righthand side of the regression ] .pull-right[ <!-- --> ] --- class: inverse, center, middle # The Multivariate Regression Model --- # Multivariate Econometric Models Overview .smallest[ `$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_kX_{ki} +u_i$$` ] -- .smallest[ - `\(Y\)` is the .hi[dependent variable] of interest - AKA "response variable," "regressand," "Left-hand side (LHS) variable" ] -- .smallest[ - `\(X_1\)` and `\(X_2\)` are .hi[independent variables] - AKA "explanatory variables", "regressors," "Right-hand side (RHS) variables", "covariates" ] -- .smallest[ - Our data consists of a spreadsheet of observed values of `\((X_{1i}, X_{2i}, Y_i)\)` ] -- .smallest[ - To model, we .hi-turquoise["regress Y on `\\(X_1\\)` and `\\(X_2\\)`"] ] -- .smallest[ - `\(\beta_0, \beta_1, \cdots, \beta_k\)` are .hi-purple[parameters] that describe the population relationships between the variables - We estimate `\(k+1\)` parameters (“betas”)<sup>.magenta[†]</sup> ] .footnote[<sup>.magenta[†]</sup> Note Bailey defines k to include both the number of variables plus the constant.] --- # Marginal Effects I `$$Y_i= \beta_0+\beta_1 X_{1i} + \beta_2 X_{2i}$$` - Consider changing `\(X_1\)` by `\(\Delta X_1\)` while holding `\(X_2\)` constant: `$$\begin{align*} Y&= \beta_0+\beta_1 X_{1} + \beta_2 X_{2} && \text{Before the change}\\ \end{align*}$$` --- # Marginal Effects I `$$Y_i= \beta_0+\beta_1 X_{1i} + \beta_2 X_{2i}$$` - Consider changing `\(X_1\)` by `\(\Delta X_1\)` while holding `\(X_2\)` constant: `$$\begin{align*} Y&= \beta_0+\beta_1 X_{1} + \beta_2 X_{2} && \text{Before the change}\\ Y+\Delta Y&= \beta_0+\beta_1 (X_{1}+\Delta X_1) + \beta_2 X_{2} && \text{After the change}\\ \end{align*}$$` --- # Marginal Effects I `$$Y_i= \beta_0+\beta_1 X_{1i} + \beta_2 X_{2i}$$` - Consider changing `\(X_1\)` by `\(\Delta X_1\)` while holding `\(X_2\)` constant: `$$\begin{align*} Y&= \beta_0+\beta_1 X_{1} + \beta_2 X_{2} && \text{Before the change}\\ Y+\Delta Y&= \beta_0+\beta_1 (X_{1}+\Delta X_1) + \beta_2 X_{2} && \text{After the change}\\ \Delta Y&= \beta_1 \Delta X_1 && \text{The difference}\\ \end{align*}$$` --- # Marginal Effects I `$$Y_i= \beta_0+\beta_1 X_{1i} + \beta_2 X_{2i}$$` - Consider changing `\(X_1\)` by `\(\Delta X_1\)` while holding `\(X_2\)` constant: `$$\begin{align*} Y&= \beta_0+\beta_1 X_{1} + \beta_2 X_{2} && \text{Before the change}\\ Y+\Delta Y&= \beta_0+\beta_1 (X_{1}+\Delta X_1) + \beta_2 X_{2} && \text{After the change}\\ \Delta Y&= \beta_1 \Delta X_1 && \text{The difference}\\ \frac{\Delta Y}{\Delta X_1} &= \beta_1 && \text{Solving for } \beta_1\\ \end{align*}$$` --- # Marginal Effects II `$$\beta_1 =\frac{\Delta Y}{\Delta X_1}\text{ holding } X_2 \text{ constant}$$` -- Similarly, for `\(\beta_2\)`: `$$\beta_2 =\frac{\Delta Y}{\Delta X_2}\text{ holding }X_1 \text{ constant}$$` -- And for the constant, `\(\beta_0\)`: `$$\beta_0 =\text{predicted value of Y when } X_1=0, \; X_2=0$$` --- # You Can Keep Your Intuitions...But They're Wrong Now .pull-left[ - We have been envisioning OLS regressions as the equation of a line through a scatterplot of data on two variables, `\(X\)` and `\(Y\)` - `\(\beta_0\)`: "intercept" - `\(\beta_1\)`: "slope" - With 3+ variables, OLS regression is no longer a "line" for us to estimate ] .pull-right[
] --- # The "Constant" - Alternatively, we can write the population regression equation as: `$$Y_i=\beta_0\color{#e64173}{X_{0i}}+\beta_1X_{1i}+\beta_2X_{2i}+u_i$$` - Here, we added `\(X_{0i}\)` to `\(\beta_0\)` - `\(X_{0i}\)` is a .hi[constant regressor], as we define `\(X_{0i}=1\)` for all `\(i\)` observations - Likewise, `\(\beta_0\)` is more generally called the .hi[“constant”] term in the regression (instead of the “intercept”) - This may seem silly and trivial, but this will be useful next class! --- # The Population Regression Model: Example I .content-box-green[ .green[**Example**:] .smaller[ `$$\text{Beer Consumption}_i=\beta_0+\beta_1Price_i+\beta_2Income_i+\beta_3\text{Nachos Price}_i+\beta_4\text{Wine Price}+u_i$$` ] ] - Let's see what you remember from micro(econ)! -- - What measures the **price effect**? What sign should it have? -- - What measures the **income effect**? What sign should it have? What should inferior or normal (necessities & luxury) goods look like? -- - What measures the **cross-price effect(s)**? What sign should substitutes and complements have? --- # The Population Regression Model: Example I .content-box-green[ .green[**Example**:] .smaller[ `$$\widehat{\text{Beer Consumption}_i}=20-1.5Price_i+1.25Income_i-0.75\text{Nachos Price}_i+1.3\text{Wine Price}_i$$` ] ] - Interpret each `\(\hat{\beta}\)` --- # Multivariate OLS in R .left-code[ .code60[ ```r # run regression of testscr on str and el_pct school_reg_2 <- lm(testscr ~ str + el_pct, data = CASchool) ``` ] ] .right-plot[ .smaller[ - Format for regression is `lm(y ~ x1 + x2, data = df)` - `y` is dependent variable (listed first!) - `~` means “modeled by” - `x1` and `x2` are the independent variable - `df` is the dataframe where the data is stored ] ] --- # Multivariate OLS in R II .left-code[ .code60[ ```r # look at reg object school_reg_2 ``` ``` ## ## Call: ## lm(formula = testscr ~ str + el_pct, data = CASchool) ## ## Coefficients: ## (Intercept) str el_pct ## 686.0322 -1.1013 -0.6498 ``` ] ] .right-plot[ - Stored as an `lm` object called `school_reg_2`, a `list` object ] --- # Multivariate OLS in R III .code50[ ```r summary(school_reg_2) # get full summary ``` ``` ## ## Call: ## lm(formula = testscr ~ str + el_pct, data = CASchool) ## ## Residuals: ## Min 1Q Median 3Q Max ## -48.845 -10.240 -0.308 9.815 43.461 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 686.03225 7.41131 92.566 < 2e-16 *** ## str -1.10130 0.38028 -2.896 0.00398 ** ## el_pct -0.64978 0.03934 -16.516 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.46 on 417 degrees of freedom ## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237 ## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16 ``` ] --- # Multivariate OLS in R IV: broom .left-column[ .center[  ] ] .right-column[ .smaller[ .code50[ ```r # load packages library(broom) # tidy regression output tidy(school_reg_2) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["term"],"name":[1],"type":["chr"],"align":["left"]},{"label":["estimate"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["std.error"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["statistic"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["p.value"],"name":[5],"type":["dbl"],"align":["right"]}],"data":[{"1":"(Intercept)","2":"686.0322487","3":"7.41131248","4":"92.565554","5":"3.871501e-280"},{"1":"str","2":"-1.1012959","3":"0.38027832","4":"-2.896026","5":"3.978056e-03"},{"1":"el_pct","2":"-0.6497768","3":"0.03934255","4":"-16.515879","5":"1.657506e-47"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] ] --- # Multivariate Regression Output Table .pull-left[ .code50[ ```r library(huxtable) huxreg("Model 1" = school_reg, "Model 2" = school_reg_2, coefs = c("Intercept" = "(Intercept)", "Class Size" = "str", "%ESL Students" = "el_pct"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2) ``` ] ] .pull-right[ .quitesmall[
Model 1
Model 2
698.93 ***
686.03 ***
Class Size
-2.28 ***
-1.10 **
%ESL Students
-0.65 ***
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ]