ABCDEFGHIJ0123456789 |
state <fctr> | year <fctr> | deaths <dbl> | |
---|---|---|---|
Alabama | 2012 | 13.316056 | |
Alaska | 2012 | 12.311976 | |
Arizona | 2012 | 13.720419 | |
Arkansas | 2012 | 16.466730 | |
California | 2012 | 8.756507 | |
Colorado | 2012 | 10.092204 |
ABCDEFGHIJ0123456789 |
state <fctr> | year <fctr> | deaths <dbl> | |
---|---|---|---|
Alabama | 2012 | 13.316056 | |
Alaska | 2012 | 12.311976 | |
Arizona | 2012 | 13.720419 | |
Arkansas | 2012 | 16.466730 | |
California | 2012 | 8.756507 | |
Colorado | 2012 | 10.092204 |
ABCDEFGHIJ0123456789 |
state <fctr> | year <fctr> | deaths <dbl> | |
---|---|---|---|
Maryland | 2007 | 10.866679 | |
Maryland | 2008 | 10.740963 | |
Maryland | 2009 | 9.892754 | |
Maryland | 2010 | 8.783883 | |
Maryland | 2011 | 8.626745 | |
Maryland | 2012 | 8.941916 |
ABCDEFGHIJ0123456789 |
state <fctr> | year <fctr> | deaths <dbl> | |
---|---|---|---|
Alabama | 2007 | 18.075232 | |
Alabama | 2008 | 16.289227 | |
Alabama | 2009 | 13.833678 | |
Alabama | 2010 | 13.434084 | |
Alabama | 2011 | 13.771989 | |
Alabama | 2012 | 13.316056 | |
Alaska | 2007 | 16.301184 | |
Alaska | 2008 | 12.744090 | |
Alaska | 2009 | 12.973849 | |
Alaska | 2010 | 11.670893 |
ABCDEFGHIJ0123456789 |
state <fctr> | year <fctr> | deaths <dbl> | |
---|---|---|---|
Alabama | 2007 | 18.075232 | |
Alabama | 2008 | 16.289227 | |
Alabama | 2009 | 13.833678 | |
Alabama | 2010 | 13.434084 | |
Alabama | 2011 | 13.771989 | |
Alabama | 2012 | 13.316056 | |
Alaska | 2007 | 16.301184 | |
Alaska | 2008 | 12.744090 | |
Alaska | 2009 | 12.973849 | |
Alaska | 2010 | 11.670893 |
Panel or Longitudinal data contains
Thus, our regression equation looks like:
^Yit=β0+β1Xit+uit
for individual i in time t.
ABCDEFGHIJ0123456789 |
state <fctr> | year <fctr> | deaths <dbl> | |
---|---|---|---|
Alabama | 2007 | 18.075232 | |
Alabama | 2008 | 16.289227 | |
Alabama | 2009 | 13.833678 | |
Alabama | 2010 | 13.434084 | |
Alabama | 2011 | 13.771989 | |
Alabama | 2012 | 13.316056 | |
Alaska | 2007 | 16.301184 | |
Alaska | 2008 | 12.744090 | |
Alaska | 2009 | 12.973849 | |
Alaska | 2010 | 11.670893 |
Example: Do cell phones cause more traffic fatalities?
No measure of cell phones used while driving
cell_plans
as a proxy for cell phone usageState-level data over 6 years
glimpse(phones)
## Rows: 306## Columns: 8## $ year <fct> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…## $ state <fct> Alabama, Alaska, Arizona, Arkansas, California, Colorad…## $ urban_percent <dbl> 30, 55, 45, 21, 54, 34, 84, 31, 100, 53, 39, 45, 11, 56…## $ cell_plans <dbl> 8135.525, 6730.282, 7572.465, 8071.125, 8821.933, 8162.…## $ cell_ban <fct> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ text_ban <fct> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ deaths <dbl> 18.075232, 16.301184, 16.930578, 19.595430, 12.104340, …## $ year_num <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…
phones %>% count(year)
ABCDEFGHIJ0123456789 |
year <fctr> | n <int> | |||
---|---|---|---|---|
2007 | 51 | |||
2008 | 51 | |||
2009 | 51 | |||
2010 | 51 | |||
2011 | 51 | |||
2012 | 51 |
phones %>% summarize(States = n_distinct(state), Years = n_distinct(year))
ABCDEFGHIJ0123456789 |
States <int> | Years <int> | |||
---|---|---|---|---|
51 | 6 |
# install.packages("plm")library(plm)pdim(phones, index=c("state","year"))
## Balanced Panel: n = 51, T = 6, N = 306
plm
package for panel data in R
pdim()
checks dimensions of panel dataset
index=
vector of "group" & "year" variablesReturns with a summary of:
n
groupsT
periodsN
total observaitons^Yit=β0+β1Xit+uit
^Yit=β0+β1Xit+uit
pooled <- lm(deaths ~ cell_plans, data = phones)pooled %>% tidy()
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 17.3371034167 | 0.975384504 | 17.774635 | 5.821724e-49 |
cell_plans | -0.0005666385 | 0.000106975 | -5.296926 | 2.264086e-07 |
ggplot(data = phones)+ aes(x = cell_plans, y = deaths)+ geom_point()+ labs(x = "Cell Phones Per 10,000 People", y = "Deaths Per Billion Miles Driven")+ theme_bw(base_family = "Fira Sans Condensed", base_size=14)
ggplot(data = phones)+ aes(x = cell_plans, y = deaths)+ geom_point()+ geom_smooth(method = "lm", color = "red")+ labs(x = "Cell Phones Per 10,000 People", y = "Deaths Per Billion Miles Driven")+ theme_bw(base_family = "Fira Sans Condensed", base_size=14)
The expected value of the residuals is 0 E[u]=0
The variance of the residuals over X is constant: var(u|X)=σ2u
Errors are not correlated across observations: cor(ui,uj)=0∀i≠j
There is no correlation between X and the error term: cor(X,u)=0 or E[u|X]=0
^Yit=β0+β1Xit+ϵit
Assumption 3: cor(ui,uj)=0∀i≠j
Pooled regression model is biased because it ignores:
Thus, errors are serially or auto-correlated; cor(ui,uj)≠0 within same i and within same t
^Deathsit=β0+β1Cell Phonesit+uit
Multiple observations from same state i
Multiple observations from same year t
phones %>% filter(state %in% c("District of Columbia", "Maryland", "Texas", "California", "Kansas")) %>%ggplot(data = .)+ aes(x = cell_plans, y = deaths, color = state)+ geom_point()+ geom_smooth(method = "lm")+ labs(x = "Cell Phones Per 10,000 People", y = "Deaths Per Billion Miles Driven", color = NULL)+ theme_bw(base_family = "Fira Sans Condensed", base_size=14)+ theme(legend.position = "top")
phones %>% filter(state %in% c("District of Columbia", "Maryland", "Texas", "California", "Kansas")) %>%ggplot(data = .)+ aes(x = cell_plans, y = deaths, color = state)+ geom_point()+ geom_smooth(method = "lm")+ labs(x = "Cell Phones Per 10,000 People", y = "Deaths Per Billion Miles Driven", color = NULL)+ theme_bw(base_family = "Fira Sans Condensed", base_size=14)+ theme(legend.position = "none")+ facet_wrap(~state, ncol=3)
ggplot(data = phones)+ aes(x = cell_plans, y = deaths, color = state)+ geom_point()+ geom_smooth(method = "lm")+ labs(x = "Cell Phones Per 10,000 People", y = "Deaths Per Billion Miles Driven", color = NULL)+ theme_bw(base_family = "Fira Sans Condensed")+ theme(legend.position = "none")+ facet_wrap(~state, ncol=7)
^Deathsit=β0+β1Cell Phonesit+uit
^Deathsit=β0+β1Cell Phonesit+uit
cor(uit,cell phonesit)≠0E[uit|cell phonesit]≠0
^Deathsit=β0+β1Cell Phonesit+uit
cor(uit,cell phonesit)≠0E[uit|cell phonesit]≠0
^Deathsit=β0+β1Cell Phonesit+uit
cor(uit,cell phonesit)≠0E[uit|cell phonesit]≠0
A simple pooled model likely contains lots of omitted variable bias
Many (often unobservable) factors that determine both Phones & Deaths
A simple pooled model likely contains lots of omitted variable bias
Many (often unobservable) factors that determine both Phones & Deaths
But the beauty of this is that most of these factors systematically vary by U.S. State and are stable over time!
We can simply “control for State” to safely remove the influence of all of these factors!
Much of the endogeneity in Xit can be explained by systematic differences across i (groups)
Exploit the systematic variation across groups with a fixed effects model
Much of the endogeneity in Xit can be explained by systematic differences across i (groups)
Exploit the systematic variation across groups with a fixed effects model
Decompose the model error term into two parts:
uit=αi+ϵit
uit=αi+ϵit
αi are group-specific fixed effects
This includes all factors that do not change within group i over time
uit=αi+ϵit
ϵit is the remaining random error
ϵit includes all other factors affecting Yit not contained in group effect αi
ˆYit=β0+β1Xit+αi+ϵit
We've pulled αi out of the original error term into the regression
Essentially we’ll estimate an intercept for each group (minus one, which is β0)
Must have multiple observations (over time) for each group (i.e. panel data)
^Deathsit=β0+β1Cell phonesit+αi+ϵit
αi is the State fixed effect
There could still be factors in ϵit that are correlated with Cell phonesit!
ˆYit=β0+β1Xit+αi+ϵit
Least Squares Dummy Variable (LSDV) approach
De-meaned data approach
^Yit=β0+β1Xit+β2D1i+β3D2i+⋯+βND(N−1)i+ϵit
^Yit=β0+β1Xit+β2D1i+β3D2i+⋯+βND(N−1)i+ϵit
^Yit=β0+β1Xit+β2D1i+β3D2i+⋯+βND(N−1)i+ϵit
R
^Yit=β0+β1Xit+β2D1i+β3D2i+⋯+βND(N−1)i+ϵit
R
Example: ^Deathsit=β0+β1Cell Phonesit+Alaskai+⋯+Wyomingi
^Deathsit=β0+β1Cell Phonesit+Alaskai+⋯+Wyomingi
If state
is a factor
variable, just include it in the regression
R
automatically creates N−1 dummy variables and includes them in the regression
fe_reg_1 <- lm(deaths ~ cell_plans + state, data = phones)fe_reg_1 %>% tidy()
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 25.507679925 | 1.0176400289 | 25.06552337 | 1.241581e-70 |
cell_plans | -0.001203742 | 0.0001013125 | -11.88147584 | 3.483442e-26 |
stateAlaska | -2.484164783 | 0.6745076282 | -3.68293060 | 2.816972e-04 |
stateArizona | -1.510577383 | 0.6704569688 | -2.25305643 | 2.510925e-02 |
stateArkansas | 3.192662931 | 0.6664383936 | 4.79063476 | 2.829319e-06 |
stateCalifornia | -4.978668651 | 0.6655467951 | -7.48056889 | 1.206933e-12 |
stateColorado | -4.344553493 | 0.6654735335 | -6.52851432 | 3.588784e-10 |
stateConnecticut | -6.595185530 | 0.6654428902 | -9.91097152 | 8.698802e-20 |
stateDelaware | -2.098393628 | 0.6666483193 | -3.14767707 | 1.842218e-03 |
stateDistrict of Columbia | 6.355790010 | 1.2897172620 | 4.92804911 | 1.499627e-06 |
Alternatively, we can control our regression for group fixed effects without directly estimating them
We simply de-mean the data for each group
Alternatively, we can control our regression for group fixed effects without directly estimating them
We simply de-mean the data for each group
For each group i, find the means (over time, t): ˉYi=β0+β1ˉXi+ˉαi+ˉϵit
Alternatively, we can control our regression for group fixed effects without directly estimating them
We simply de-mean the data for each group
For each group i, find the means (over time, t): ˉYi=β0+β1ˉXi+ˉαi+ˉϵit
^Yit=β0+β1Xit+uitˉYi=β0+β1ˉXi+ˉαi+ˉϵi
^Yit=β0+β1Xit+uitˉYi=β0+β1ˉXi+ˉαi+ˉϵi
Yi−ˉYi=β1(Xit−ˉXi)+˜ϵit˜Yit=β1˜Xit+˜ϵit
^Yit=β0+β1Xit+uitˉYi=β0+β1ˉXi+ˉαi+ˉϵi
Yi−ˉYi=β1(Xit−ˉXi)+˜ϵit˜Yit=β1˜Xit+˜ϵit
Within each group i, the de-meaned variables ˜Yit and ˜Xit's all have a mean of 0†
Variables that don't change over time will drop out of analysis altogether
Removes any source of variation across groups to only work with variation within each group
† Recall Rule 4 from the 2.3 class notes on the Summation Operator: ∑(Xi−ˉX)=0
˜Yit=β1˜Xit+˜ϵit
Yields identical results to dummy variable approach
More useful when we have many groups (would be many dummies)
Demonstrates intuition behind fixed effects:
We are basically comparing groups to themselves over time
Ignore all differences between groups, only look at differences within groups over time
# get means of Y and X by statemeans_state<-phones %>% group_by(state) %>% summarize(avg_deaths = mean(deaths), avg_phones = mean(cell_plans))# look at itmeans_state
# get means of Y and X by statemeans_state<-phones %>% group_by(state) %>% summarize(avg_deaths = mean(deaths), avg_phones = mean(cell_plans))# look at itmeans_state
ABCDEFGHIJ0123456789 |
state <fctr> | avg_deaths <dbl> | avg_phones <dbl> |
---|---|---|
Alabama | 14.786711 | 8906.370 |
Alaska | 13.612953 | 7817.759 |
Arizona | 14.249825 | 8097.482 |
Arkansas | 17.543881 | 9268.153 |
California | 9.659712 | 9029.594 |
Colorado | 10.351405 | 8981.762 |
Connecticut | 8.141739 | 8947.729 |
Delaware | 12.209610 | 9304.052 |
District of Columbia | 8.015895 | 19811.205 |
Florida | 13.544635 | 9078.592 |
ggplot(data = means_state)+ aes(x = fct_reorder(state, avg_deaths), y = avg_deaths, color = state)+ geom_point()+ geom_segment(aes(y = 0, yend = avg_deaths, x = state, xend = state))+ coord_flip()+ labs(x = "Cell Phones Per 10,000 People", y = "Deaths Per Billion Miles Driven", color = NULL)+ theme_bw(base_family = "Fira Sans Condensed", base_size=10)+ theme(legend.position = "none")
The plm
package is designed for panel data
plm()
function is just like lm()
, with some additional arguments:
index="group_variable_name"
set equal to the name of your factor
variable for the groupsmodel=
set equal to "within"
to use fixed-effects (within-estimator)#install.packages("plm")library(plm)fe_reg_1_alt<-plm(deaths ~ cell_plans, data = phones, index = "state", model = "within")
fe_reg_1_alt %>% tidy()
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
cell_plans | -0.001203742 | 0.0001013125 | -11.88148 | 3.483442e-26 |
State fixed effect controls for all factors that vary by state but are stable over time
But there are still other (often unobservable) factors that affect both Phones and Deaths, that don’t vary by State
State fixed effect controls for all factors that vary by state but are stable over time
But there are still other (often unobservable) factors that affect both Phones and Deaths, that don’t vary by State
If these factors systematically vary over time, but are the same by State, then we can “control for Year” to safely remove the influence of all of these factors!
A one-way fixed effects model estimates a fixed effect for groups
Two-way fixed effects model estimates fixed effects for both groups and time periods ^Yit=β0+β1Xit+αi+θt+νit
αi: group fixed effects
θt: time fixed effects
νit remaining random error
^Deathsit=β0+β1Cell phonesit+αi+θt+νit
αi: State fixed effects
θt: Year fixed effects
# find averages for yearsmeans_year<-phones %>% group_by(year) %>% summarize(avg_deaths = mean(deaths), avg_phones = mean(cell_plans))means_year
ABCDEFGHIJ0123456789 |
year <fctr> | avg_deaths <dbl> | avg_phones <dbl> | ||
---|---|---|---|---|
2007 | 14.00751 | 8064.531 | ||
2008 | 12.87156 | 8482.903 | ||
2009 | 12.08632 | 8859.706 | ||
2010 | 11.61487 | 9134.592 | ||
2011 | 11.36431 | 9485.238 | ||
2012 | 11.65666 | 9660.474 |
ggplot(data = phones)+ aes(x = year, y = deaths)+ geom_point(aes(color = year))+ # Add the yearly means as black points geom_point(data = means_year, aes(x = year, y = avg_deaths), size = 3, color = "black")+ geom_path(data = means_year, aes(x = year, y = avg_deaths), size = 1)+ theme_bw(base_family = "Fira Sans Condensed", base_size = 14)+ theme(legend.position = "none")
ˆYit=β0+β1Xit+αi+θt+νit
1) Least Squares Dummy Variable (LSDV) Approach: add dummies for both groups and time periods (separate intercepts for groups and times)
ˆYit=β0+β1Xit+αi+θt+νit
1) Least Squares Dummy Variable (LSDV) Approach: add dummies for both groups and time periods (separate intercepts for groups and times)
2) Fully De-meaned data: ˜Yit=β1˜Xit+˜νit
where for each variable: ~varit=varit−¯vart−¯vari
ˆYit=β0+β1Xit+αi+θt+νit
1) Least Squares Dummy Variable (LSDV) Approach: add dummies for both groups and time periods (separate intercepts for groups and times)
2) Fully De-meaned data: ˜Yit=β1˜Xit+˜νit
where for each variable: ~varit=varit−¯vart−¯vari
3) Hybrid: de-mean for one effect (groups or years) and add dummies for the other effect (years or groups)
fe2_reg_1 <- lm(deaths ~ cell_plans + state + year, data = phones)fe2_reg_1 %>% tidy()
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 18.9304707399 | 1.4511323962 | 13.0453092 | 5.427406e-30 |
cell_plans | -0.0002995294 | 0.0001723149 | -1.7382677 | 8.339982e-02 |
stateAlaska | -1.4998292482 | 0.6241082951 | -2.4031554 | 1.698648e-02 |
stateArizona | -0.7791714713 | 0.6113519094 | -1.2745057 | 2.036724e-01 |
stateArkansas | 2.8655344756 | 0.5985062952 | 4.7878101 | 2.895040e-06 |
stateCalifornia | -5.0900897113 | 0.5956293282 | -8.5457338 | 1.299236e-15 |
stateColorado | -4.4127241692 | 0.5953924847 | -7.4114543 | 1.945083e-12 |
stateConnecticut | -6.6325834801 | 0.5952933996 | -11.1417051 | 1.169797e-23 |
stateDelaware | -2.4579829953 | 0.5991822226 | -4.1022295 | 5.546475e-05 |
stateDistrict of Columbia | -3.5044963616 | 1.9710939218 | -1.7779449 | 7.663326e-02 |
fe2_reg_2 <- plm(deaths ~ cell_plans, index = c("state", "year"), model = "within", data = phones)fe2_reg_2 %>% tidy()
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
cell_plans | -0.001203742 | 0.0001013125 | -11.88148 | 3.483442e-26 |
plm()
command allows for multiple effects to be fit inside index=c("group", "time")
State fixed effect absorbs all unobserved factors that vary by state, but are constant over time
Year fixed effect absorbs all unobserved factors that vary by year, but are constant over States
But there are still other (often unobservable) factors that affect both Phones and Deaths, that vary by State and change over time!
We will also need to control for these variables (not picked up by fixed effects!)
^Deathsit=β1Cell Phonesit+αi+θt+urban pctit+cell banit+text banit
fe2_controls_reg <- plm(deaths ~ cell_plans + text_ban + urban_percent + cell_ban, data = phones, index = c("state","year"), model = "within", effect = "twoways") fe2_controls_reg %>% tidy()
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
cell_plans | -0.0003403735 | 0.0001729402 | -1.968157 | 0.05017303 |
text_ban1 | 0.2559261569 | 0.2221923049 | 1.151823 | 0.25051208 |
urban_percent | 0.0131347657 | 0.0111986138 | 1.172892 | 0.24197354 |
cell_ban1 | -0.6797956522 | 0.4029491232 | -1.687051 | 0.09286115 |
library(huxtable)huxreg("Pooled" = pooled, "State Effects" = fe_reg_1, "State & Year Effects" = fe2_reg_1, "With Controls" = fe2_controls_reg, coefs = c("Intercept" = "(Intercept)", "Cell phones" = "cell_plans", "Cell Ban" = "cell_ban1", "Texting Ban" = "text_ban1", "Urbanization Rate" = "urban_percent"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 4)
Pooled | State Effects | State & Year Effects | With Controls | |
---|---|---|---|---|
Intercept | 17.3371 *** | 25.5077 *** | 18.9305 *** | |
(0.9754) | (1.0176) | (1.4511) | ||
Cell phones | -0.0006 *** | -0.0012 *** | -0.0003 | -0.0003 |
(0.0001) | (0.0001) | (0.0002) | (0.0002) | |
Cell Ban | -0.6798 | |||
(0.4029) | ||||
Texting Ban | 0.2559 | |||
(0.2222) | ||||
Urbanization Rate | 0.0131 | |||
(0.0112) | ||||
N | 306 | 306 | 306 | 306 |
R-Squared | 0.0845 | 0.9055 | 0.9259 | 0.0329 |
SER | 3.2791 | 1.1526 | 1.0310 | |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
ABCDEFGHIJ0123456789 |
state <fctr> | year <fctr> | deaths <dbl> | cell_plans <dbl> |
---|---|---|---|
Alabama | 2012 | 13.316056 | 9433.800 |
Alaska | 2012 | 12.311976 | 8872.799 |
Arizona | 2012 | 13.720419 | 8810.889 |
Arkansas | 2012 | 16.466730 | 10047.027 |
California | 2012 | 8.756507 | 9362.424 |
Colorado | 2012 | 10.092204 | 9403.225 |
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |