---
title: "2.5 — OLS: Precision and Diagnostics - R Practice"
author: "Solutions"
date: "ECON 480 — Econometrics — Fall 2020"
output:
html_document:
df_print: paged
#theme:
toc: true
toc_depth: 3
toc_float: true
code_folding: show
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Question 1
Our [data](https://github.com/fivethirtyeight/data/tree/master/congress-trump-score) comes from fivethirtyeight's [Trump Congress tracker](https://projects.fivethirtyeight.com/congress-trump-score/). Download and read in (`read_csv`) the data.
- [ `congress-trump-score.csv`](/data/congress-trump-score.csv)
---
```{r}
library(tidyverse)
# again, my path (on website) may be different than yours
congress<-read_csv("../data/congress-trump-score.csv")
```
---
## Question 2
Look at the data with `glimpse()`.
---
```{r}
congress %>%
glimpse()
```
---
## Question 3
We want to see *how does the percentage that a member of Congress' agrees with President Trump (`agree_pct`) depend on the result of the 2016 Presidential election in their district (`net_trump_vote`)*? First, plot a scatterplot of `agree_pct` on `net_trump_vote`. Add a regression line with an additional layer of `geom_smooth(method="lm")`.
---
```{r, fig.retina=3}
scatter<-ggplot(data = congress)+
aes(x = net_trump_vote, y=agree_pct)+
geom_point()+
geom_smooth(method="lm")+
theme_light()
scatter
```
---
## Question 4
Find the correlation between `agree_pct` and `net_trump_vote`.
---
```{r}
# using base R
cor(congress$net_trump_vote, congress$agree_pct)
# using tidyverse
congress %>%
summarize(cor = cor(net_trump_vote, agree_pct))
```
---
## Question 5
We weant to predict the following model:
$$\widehat{\text{agree_pct}}= \hat{\beta_0}+\hat{\beta_1}\text{net_trump_vote}$$
Run a regression, and save it as an object. Now get a `summary()` of it.
---
```{r}
reg<-lm(agree_pct ~ net_trump_vote, data = congress)
summary(reg)
```
---
### Part A
What is $\hat{\beta_0}$? What does it mean in the context of our question?
---
```{r}
# I just want to show you how to extract these and put it into text with markdown
# This relies on the code in Question 6 Part B
library(broom)
reg_tidy<-reg %>%
tidy()
beta_0_hat<-reg_tidy %>% # save this as beta_0_hat
slice(1) %>% # look only at first row (intercept)
pull(estimate) %>% # extract the value of the estimate
round(.,2) # round to 2 decimal places
beta_0_hat # look at it
```
$\hat{\beta_0}=$ `r beta_0_hat`, meaning the rate at which Members of Congress agree with the President when the Net Trump Vote was 0 (i.e. both candidates got 50%) is `r beta_0_hat` (out of 1).^[Look at my markdown code to see that I can write my R code directly in my text with one backtick and r. This calls my `beta_0_hat` object I made.]
---
### Part B
What is $\hat{\beta_1}$? What does it mean in the context of our question?
---
```{r}
beta_1_hat<-reg_tidy %>% # save this as beta_0_hat
slice(2) %>% # look only at second row (net_trump_score)
pull(estimate) %>% # extract the value of the estimate
round(.,2) # round to 2 decimal places
beta_1_hat # look at it
```
$\hat{\beta_1}=$ `r beta_1_hat`, meaning for every additional Net Trump Vote % in a Member of Congress' district, that Member will agree `r beta_1_hat` more with the President (out of 1).
---
### Part C
What is $R^2$? What does it mean?
---
```{r}
# this uses the glance command in question 6 part C.
r_sq<-glance(reg)$r.squared %>%
round(.,2) # round to 2 decimals
r_sq # look at it
```
$R^2 =$ `r r_sq`, meaning we have explained `r r_sq*100`%^[Again, look at my markdown: I am calling `r_sq` and multiplying it by 100 inside the code chunk.] of the variation in `agree_pct`.
---
### Part D
What is the $SER$? What does it mean?
---
```{r}
# this uses the glance command in question 6 part C.
ser<-glance(reg)$sigma %>%
round(.,2) # round to 2 decimals
ser # look at it
```
$SER =$ `r ser`, meaning the average size of the residuals or errors (distance from an actual data point to the predicted regression line) is `r ser` (out of 1).
---
## Question 6
We can look at regression outputs in a tidier way, with the `broom` package.
### Part A
Install and then load `broom`.
---
```{r}
#install.packages("broom")
library(broom)
```
---
### Part B
Run the function `tidy()` on your regression object (saved in question 5). Save this as an object and then look at it.
---
```{r}
reg_tidy<-reg %>%
tidy()
reg_tidy
```
---
### Part C
Run the `glance()` function on your original regression object. What does it show you?
---
```{r}
reg %>%
glance()
```
It shows you summary statistics of the regression, particularly its goodness of fit.
- `r.squared` is the $R^2$ value
- `sigma` is the $SER \; (\hat{\sigma_u})$
That's all we need for now.
---
### Part D
Now run the `augment()` function on your original regression object, and save this as an object. Look at it. What does it show you?
---
```{r}
reg_aug<-reg %>%
augment()
reg_aug
```
It takes the original data points for `Y` and `X` and adds new variables with values for those observations:
- `.fitted` is $\hat{Y_i|X}$, the predicted value for a specific `net_trump_vote`
- `.resid` is $\hat{u_i}$, the residual or error for a specific `net_trump_vote`
$$\begin{align*}
\hat{u_i}&=Y_i-\hat{Y_i}\\
.resid &= agree\text{_}pct - .fitted\\
\end{align*}$$
That's all we need right now.
---
## Question 7
Now let's start looking at the residuals of the regression.
### Part A
Take the augmented regression object from Question 6-D and use it as the source of your data to create a histogram, where $x$ is `.resid`. Does it look roughly normal?
---
```{r, fig.retina=3}
ggplot(data = reg_aug)+
aes(x = .resid)+
geom_histogram(color="white")+
theme_classic()
```
It does look rougly normal.
---
### Part B
Take the augmented regression object and make a residual plot, which is a scatterplot where `x` is the normal `x` variable, and `y` is the `.resid`. Feel free to add a horizontal line at 0 with `geom_hline(yintercept=0)`.
---
```{r, fig.retina=3}
ggplot(data = reg_aug)+
aes(x = net_trump_vote, y=.resid)+
geom_point()+
geom_hline(yintercept = 0, color = "red")+
theme_classic()
```
---
## Question 8
Now let's try presenting your results in a regression table. Install and load `huxtable`, and run the `huxreg()` command. Your main input is your regression object you saved in Question 5. Feel free to customize the output of this table (see the slides).
---
```{r}
#install.packages("huxtable")
library(huxtable)
# basic
huxreg(reg)
# some customizations
huxreg("Agree with President (Proportion)" = reg,
coefs = c("Intercept" = "(Intercept)",
"Net Trump Vote" = "net_trump_vote"),
statistics = c("n" = "nobs",
"R-squared" = "r.squared",
"SER" = "sigma"),
number_format = 4)
```
---
## Question 9
Now let's check for heteroskedasticity.
### Part A
Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?
---
It does seem like it would be, looking at the residual plot, since there are clear downward-moving clusters of points that are at times above the line, then below the line, then above and below the line.
---
### Part B
Install and load the `lmtest` package and run `bptest` on your regression object. Was the data heteroskedastic or homoskedastic?
---
```{r}
#install.packages("lmtest")
library(lmtest)
bptest(reg)
```
Since the $p$-value was less than 0.05, we can reject the null hypothesis $H_0$ (that the errors are homoskedastic), and conclude that the errors are heteroskedastic. We will need to fix them if we want accurate standard errors.
---
### Part C
Now let's make some heteroskedasticity-robust standard errors. Install and load the `estimatr` package and use the `lm_robust()` command (instead of the `lm()` command) to run a new regression (and save it). Make sure you add `se_type="stata"` inside the command to calculate robust SEs. Look at it. What changes?
---
```{r}
#install.packages("estimatr")
library(estimatr)
reg_rse<-lm_robust(agree_pct ~ net_trump_vote, data = congress, se_type="stata")
reg_rse
```
The standard errors on both estimates $(\hat{\beta_0}$ and $\hat{\beta_1})$ change, but the estimates themselves do not change!
Note `t values` and `Pr(>|t|)` will change, and this command adds confidence intervals (`CI`) and degrees of freedom for $t$, `DF`, but we have not covered any of these yet!
---
### Part D
Now let's see this in a nice regression table. Use `huxreg()` again, but add both your original regression and your regression saved in part C. Notice any changes?
---
```{r}
huxreg("Original" = reg, "Robust SEs" = reg_rse,
number_format = 6) # need to see more decimal places!
```
Note that the standard errors are too small to register anything within 3 decimal places (the default). If we were to make a true percent variable out of 100% instead of out of 1 (as I have you do in Question 13), we would see more of a difference. Look carefully, the standard error on `net_trump_vote` is 0.000192 (from Question 5), and the robust standard error is 0.000149 from Part C.
---
## Question 10
Now let's check for outliers.
### Part A
Just looking at the scatterplot in Question 3, do you see any outliers?
---
It looks like there might be one - for the Member whose district's Net vote was somewhere around $-60$ and s/he agrees with the President about $0.6$.
---
### Part B
Install and load the `car` package. Run the `outlierTest` command on your regression object. Does it detect any outliers?
---
```{r}
#install.packages("car")
library(car)
outlierTest(reg)
```
Yes, surprisingly it's not the point that looks like it stands out on the graph (middle-left), it's a different point.
---
### Part C
Look in your original data to match this outlier with an observation. Hint: use the `slice()` command, as the outlier test gave you an observation (row) number!
---
```{r}
congress %>%
slice(1708)
```
---
## Question 11 (Optional)
This data is still a bit messy. Let's check in on your `tidyverse` skills again! For example, we'd probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.
### Part A
First, the variable `congress` (session of Congress) seems a bit off. Get a `count()` of `congress`.
---
```{r}
congress %>%
count(congress)
```
---
### Part B
Let's get rid of the `0` values for `congress` (someone made a mistake coding this, probably). Also, while we're at it, let's take `agree_pct` and `mutate` a variable that is a proper percentage (i.e. `*100`).
---
```{r}
congress_tidy<-congress %>%
filter(congress!=0) %>%
mutate(agree_pct = agree_pct * 100)
```
---
### Part C
The variable `party` is also quite a mess. `count()` by `party` to see. Then let's `mutate` a variable to make a better measure of political party - just `"Republican"`, `"Democrat"`, and `"Independent"`. Try doing this with the `case_when()` command (as your `mutate` formula).^[The syntax for `case_when()` is to have a series of `condition ~ "Outcome"`, separated by commas. For example, one condition is to assign both `"Democrat"` and `"D"` to `"Democrat"`, as in `party %in% c("Democrat", "D") ~ "Democrat"`. You could also do this with a few `ifelse()` commands, but that's a bit more awkward.] When you're done `count()` by your new party variable to make sure it worked.
---
```{r}
congress_tidy <- congress_tidy %>%
mutate(pol_party = case_when(
party %in% c("Democrat", "D") ~ "Democrat",
party %in% c("Republican", "R") ~ "Republican",
party %in% c("Independent", "I") ~ "Independent"
))
congress_tidy %>%
count(pol_party)
```
---
### Part D
Now plot a scatterplot (same as Question 3) and set `color` to your party variable. Notice `R` uses its own default colors, which don't match to the actual colors these political parties use! Make a vector where you define the party colors as follows: `party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")`. Then, run your plot again, adding the following line to customize the colors `+scale_colour_manual("Parties", values = party_colors)`.^[`"Parties"` is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer `+guides(color = F)`.]
---
```{r, fig.retina=3}
party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")
p<-ggplot(data = congress_tidy)+
aes(x = net_trump_vote, y=agree_pct)+
geom_point(aes(color = pol_party))+
geom_smooth(method="lm", color = "black")+
scale_colour_manual("Parties", values = party_colors)+
theme_light()
p
```
---
### Part E
Now facet your scatterplot by `chamber`.
---
```{r, fig.retina=3}
p+facet_wrap(~chamber)
```
```{r, fig.retina=3}
# just for kicks
p+facet_grid(chamber~pol_party)
```