3.4 — Multivariate OLS Estimators: Bias, Precision, and Fit — R Practice

Set Up

To minimize confusion, I suggest creating a new R Project and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).


Question 1

Download and read in (read_csv) the data below.

This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:

Variable Description
Amount Amount of fine (in dollars) assessed for speeding
Age Age of speeding driver (in years)
MPHover Miles per hour over the speed limit

We want to explore who gets fines, and how much. We’ll come back to the other variables (which are categorical) in this dataset in later lessons.

Question 2

How does the age of a driver affect the amount of the fine? Make a scatterplot of the Amount of the fine and the driver’s Age. Add a regression line with an additional layer of geom_smooth(method="lm").

Question 3

Find the correlation between Amount and Age.Note there are a lot of NAs (missing data) for Amount. You can verify by summary() or count() if you wish. In order to get a correlation, you will need to put use="pairwise.complete.obs" inside the cor() command.

Question 4

We weant to predict the following model:

\[\widehat{\text{Amount}_i}= \hat{\beta_0}+\hat{\beta_1}\text{Age}_i\]

Run a regression, and save it as an object. Now get a summary() of it.

Part A

Write out the estimated regression equation.

Part B

What is \(\hat{\beta_0}\)? What does it mean in the context of our question?

Part C

What is \(\hat{\beta_1}\)? What does it mean in the context of our question?

Part D

What is the marginal effect of age on amount?

Question 5

Redo question 4 with the broom package. Try out tidy() and glance(). This is just to keep you versatile.

Question 6

How big would the difference in expected fine be for two drivers, one 18 years old and one 40 years old?

Question 7

Now run the regression again, controlling for speed (MPHover).

Part A

Write the new regression equation.

Part B

What is the marginal effect of Age on Amount? What happened to it?

Part C

What is the marginal effect of MPHover on Amount?

Part D

What is \(\hat{\beta_0}\), and what does it mean?

Part E

What is the adjusted \(\bar{R}^2\)? What does it mean?

Question 8

Now suppose both the 18 year old and the 40 year old each went 10 MPH over the speed limit. How big would the difference in expected fine be for the two drivers?

Question 9

How about the difference in expected fine between two 18 year olds, one who went 10 MPH over, and one who went 30 MPH over?

Question 10

Use the huxtable package’s huxreg() command to make a regression table of your two regressions: the one from question 4, and the one from question 7.

Question 11

Are our two independent variables multicollinear? Do younger people tend to drive faster?

Part A

Get the correlation between Age and MPHover.

Part B

Make a scatterplot of MPHover on Age.

Part C

Run an auxiliary regression of MPHover on Age.

Part D

Interpret the coefficient on Age.

Part E

Look at your regression table in question 10. What happened to the standard error on Age? Why (consider the formula for variation in \(\hat{\beta_1})\)

Part F

Calculate the Variance Inflation Factor (VIF) using the car package’s vif() command.Run it on your multivariate regression object from Question 7.

Part G

Calculate the VIF manually, using what you learned in this question.

Question 12

Let’s now think about the omitted variable bias. Suppose the “true” model is the one we ran from Question 7.

Part A

Do you suppose that MPHover fits the two criteria for omitted variable bias?

Part B

Look at the regression we ran in Question 4. Consider this the “omitted” regression, where we left out MPHover. Does our estimate of the marginal effect of Age on Amount overstate or understate the true marginal effect?

Part C

Use the “true” model (Question 7), the “omitted” regression (Question 4), and our “auxiliary” regression (Question 11) to identify each of the following parameters that describe our biased estimate of the marginal effect of Age on Amount:See the notation I used in class 3.4.


Part D

From your answer in part C, how large is the omitted variable bias from leaving out MPHover?

Question 13

Make a coefficient plot of your coefficients from the regression in Question 7. The package modelsummary (which you will need to install and load) has a great command modelplot() to do this on your regression object.