2.5 — OLS: Precision and Diagnostics - R Practice
Set Up
To minimize confusion, I suggest creating a new R Project
(e.g. regression_practice
) and storing any data in that folder on your computer.
Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).
Question 1
Our data comes from fivethirtyeight’s Trump Congress tracker. Download and read in (read_csv
) the data.
Question 2
Look at the data with glimpse()
.
Question 3
We want to see how does the percentage that a member of Congress’ agrees with President Trump (agree_pct
) depend on the result of the 2016 Presidential election in their district (net_trump_vote
)? First, plot a scatterplot of agree_pct
on net_trump_vote
. Add a regression line with an additional layer of geom_smooth(method="lm")
.
Question 4
Find the correlation between agree_pct
and net_trump_vote
.
Question 5
We weant to predict the following model:
\[\widehat{\text{agree_pct}}= \hat{\beta_0}+\hat{\beta_1}\text{net_trump_vote}\]
Run a regression, and save it as an object. Now get a summary()
of it.
Part A
What is \(\hat{\beta_0}\)? What does it mean in the context of our question?
Part B
What is \(\hat{\beta_1}\)? What does it mean in the context of our question?
Part C
What is \(R^2\)? What does it mean?
Part D
What is the \(SER\)? What does it mean?
Question 6
We can look at regression outputs in a tidier way, with the broom
package.
Part A
Install and then load broom
.
Part B
Run the function tidy()
on your regression object (saved in question 5). Save this as an object and then look at it.
Part C
Run the glance()
function on your original regression object. What does it show you?
Part D
Now run the augment()
function on your original regression object, and save this as an object. Look at it. What does it show you?
Question 7
Now let’s start looking at the residuals of the regression.
Part A
Take the augmented regression object from Question 6-D and use it as the source of your data to create a histogram, where \(x\) is .resid
. Does it look roughly normal?
Part B
Take the augmented regression object and make a residual plot, which is a scatterplot where x
is the normal x
variable, and y
is the .resid
. Feel free to add a horizontal line at 0 with geom_hline(yintercept=0)
.
Question 8
Now let’s try presenting your results in a regression table. Install and load huxtable
, and run the huxreg()
command. Your main input is your regression object you saved in Question 5. Feel free to customize the output of this table (see the slides).
Question 9
Now let’s check for heteroskedasticity.
Part A
Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?
Part B
Install and load the lmtest
package and run bptest
on your regression object. Was the data heteroskedastic or homoskedastic?
Part C
Now let’s make some heteroskedasticity-robust standard errors. Install and load the estimatr
package and use the lm_robust()
command (instead of the lm()
command) to run a new regression (and save it). Make sure you add se_type="stata"
inside the command to calculate robust SEs. Look at it. What changes?
Part D
Now let’s see this in a nice regression table. Use huxreg()
again, but add both your original regression and your regression saved in part C. Notice any changes?
Question 10
Now let’s check for outliers.
Part A
Just looking at the scatterplot in Question 3, do you see any outliers?
Part B
Install and load the car
package. Run the outlierTest
command on your regression object. Does it detect any outliers?
Part C
Look in your original data to match this outlier with an observation. Hint: use the slice()
command, as the outlier test gave you an observation (row) number!
Question 11 (Optional)
This data is still a bit messy. Let’s check in on your tidyverse
skills again! For example, we’d probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.
Part A
First, the variable congress
(session of Congress) seems a bit off. Get a count()
of congress
.
Part B
Let’s get rid of the 0
values for congress
(someone made a mistake coding this, probably). Also, while we’re at it, let’s take agree_pct
and mutate
a variable that is a proper percentage (i.e. *100
).
Part C
The variable party
is also quite a mess. count()
by party
to see. Then let’s mutate
a variable to make a better measure of political party - just "Republican"
, "Democrat"
, and "Independent"
. Try doing this with the case_when()
command (as your mutate
formula).The syntax for case_when()
is to have a series of condition ~ "Outcome"
, separated by commas. For example, one condition is to assign both "Democrat"
and "D"
to "Democrat"
, as in party %in% c("Democrat", "D") ~ "Democrat"
. You could also do this with a few ifelse()
commands, but that’s a bit more awkward.
When you’re done count()
by your new party variable to make sure it worked.
Part D
Now plot a scatterplot (same as Question 3) and set color
to your party variable. Notice R
uses its own default colors, which don’t match to the actual colors these political parties use! Make a vector where you define the party colors as follows: party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")
. Then, run your plot again, adding the following line to customize the colors +scale_colour_manual("Parties", values = party_colors)
."Parties"
is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer +guides(color = F)
.
Part E
Now facet your scatterplot by chamber
.