---
title: "2.5 — OLS: Precision and Diagnostics — R Practice"
author: "YOUR NAME"
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
#theme:
toc: true
toc_depth: 3
toc_float: true
code_folding: show
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Set Up
To minimize confusion, I suggest creating a new `R Project` (e.g. `regression_practice`) and storing any data in that folder on your computer.
Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).
- [ View Project on R Studio Cloud](https://rstudio.cloud/spaces/83147/project/1637755)
## Question 1
Our [data](https://github.com/fivethirtyeight/data/tree/master/congress-trump-score) comes from fivethirtyeight's [Trump Congress tracker](https://projects.fivethirtyeight.com/congress-trump-score/). Download and read in (`read_csv`) the data.
- [ `congress-trump-score.csv`](/data/congress-trump-score.csv)
---
```{r}
# PUT CODE HERE
```
---
## Question 2
Look at the data with `glimpse()`.
---
```{r}
# PUT CODE HERE
```
---
## Question 3
We want to see *how does the percentage that a member of Congress' agrees with President Trump (`agree_pct`) depend on the result of the 2016 Presidential election in their district (`net_trump_vote`)*? First, plot a scatterplot of `agree_pct` on `net_trump_vote`. Add a regression line with an additional layer of `geom_smooth(method="lm")`.
---
```{r}
# PUT CODE HERE
```
---
## Question 4
Find the correlation between `agree_pct` and `net_trump_vote`.
---
```{r}
# PUT CODE HERE
```
---
## Question 5
We weant to predict the following model:
$$\widehat{\text{agree_pct}}= \hat{\beta_0}+\hat{\beta_1}\text{net_trump_vote}$$
Run a regression, and save it as an object. Now get a `summary()` of it.
---
```{r}
# PUT CODE HERE
```
---
### Part A
What is $\hat{\beta_0}$? What does it mean in the context of our question?
---
```{r}
# PUT CODE HERE
```
---
### Part B
What is $\hat{\beta_1}$? What does it mean in the context of our question?
---
```{r}
# PUT CODE HERE
```
---
### Part C
What is $R^2$? What does it mean?
---
```{r}
# PUT CODE HERE
```
---
### Part D
What is the $SER$? What does it mean?
---
```{r}
# PUT CODE HERE
```
---
## Question 6
We can look at regression outputs in a tidier way, with the `broom` package.
### Part A
Install and then load `broom`.
---
```{r}
# PUT CODE HERE
```
---
### Part B
Run the function `tidy()` on your regression object (saved in question 5). Save this as an object and then look at it.
---
```{r}
# PUT CODE HERE
```
---
### Part C
Run the `glance()` function on your original regression object. What does it show you?
---
```{r}
# PUT CODE HERE
```
---
### Part D
Now run the `augment()` function on your original regression object, and save this as an object. Look at it. What does it show you?
---
```{r}
# PUT CODE HERE
```
---
## Question 7
Now let's start looking at the residuals of the regression.
### Part A
Take the augmented regression object from Question 6-D and use it as the source of your data to create a histogram, where $x$ is `.resid`. Does it look roughly normal?
---
```{r}
# PUT CODE HERE
```
---
### Part B
Take the augmented regression object and make a residual plot, which is a scatterplot where `x` is the normal `x` variable, and `y` is the `.resid`. Feel free to add a horizontal line at 0 with `geom_hline(yintercept=0)`.
---
```{r}
# PUT CODE HERE
```
---
## Question 8
Now let's try presenting your results in a regression table. Install and load `huxtable`, and run the `huxreg()` command. Your main input is your regression object you saved in Question 5. Feel free to customize the output of this table (see the slides).
---
```{r}
# PUT CODE HERE
```
---
## Question 9
Now let's check for heteroskedasticity.
### Part A
Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?
---
```{r}
# PUT CODE HERE
```
---
### Part B
Install and load the `lmtest` package and run `bptest` on your regression object. Was the data heteroskedastic or homoskedastic?
---
```{r}
# PUT CODE HERE
```
---
### Part C
Now let's make some heteroskedasticity-robust standard errors. Install and load the `estimatr` package and use the `lm_robust()` command (instead of the `lm()` command) to run a new regression (and save it). Make sure you add `se_type="stata"` inside the command to calculate robust SEs. Look at it. What changes?
---
```{r}
# PUT CODE HERE
```
---
### Part D
Now let's see this in a nice regression table. Use `huxreg()` again, but add both your original regression and your regression saved in part C. Notice any changes?
---
```{r}
# PUT CODE HERE
```
---
## Question 10
Now let's check for outliers.
### Part A
Just looking at the scatterplot in Question 3, do you see any outliers?
---
```{r}
# PUT CODE HERE
```
---
### Part B
Install and load the `car` package. Run the `outlierTest` command on your regression object. Does it detect any outliers?
---
```{r}
# PUT CODE HERE
```
---
### Part C
Look in your original data to match this outlier with an observation. Hint: use the `slice()` command, as the outlier test gave you an observation (row) number!
---
```{r}
# PUT CODE HERE
```
---
## Question 11 (Optional)
This data is still a bit messy. Let's check in on your `tidyverse` skills again! For example, we'd probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.
### Part A
First, the variable `congress` (session of Congress) seems a bit off. Get a `count()` of `congress`.
---
```{r}
# PUT CODE HERE
```
---
### Part B
Let's get rid of the `0` values for `congress` (someone made a mistake coding this, probably). Also, while we're at it, let's take `agree_pct` and `mutate` a variable that is a proper percentage (i.e. `*100`).
---
```{r}
# PUT CODE HERE
```
---
### Part C
The variable `party` is also quite a mess. `count()` by `party` to see. Then let's `mutate` a variable to make a better measure of political party - just `"Republican"`, `"Democrat"`, and `"Independent"`. Try doing this with the `case_when()` command (as your `mutate` formula).^[The syntax for `case_when()` is to have a series of `condition ~ "Outcome"`, separated by commas. For example, one condition is to assign both `"Democrat"` and `"D"` to `"Democrat"`, as in `party %in% c("Democrat", "D") ~ "Democrat"`. You could also do this with a few `ifelse()` commands, but that's a bit more awkward.] When you're done `count()` by your new party variable to make sure it worked.
---
```{r}
# PUT CODE HERE
```
---
### Part D
Now plot a scatterplot (same as Question 3) and set `color` to your party variable. Notice `R` uses its own default colors, which don't match to the actual colors these political parties use! Make a vector where you define the party colors as follows: `party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")`. Then, run your plot again, adding the following line to customize the colors `+scale_colour_manual("Parties", values = party_colors)`.^[`"Parties"` is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer `+guides(color = F)`.]
---
```{r}
# PUT CODE HERE
```
---
### Part E
Now facet your scatterplot by `chamber`.
---
```{r}
# PUT CODE HERE
```
---