3.7 — Regression with Interaction Effects — R Practice

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).

Question 1

We are returning to the speeding tickets data that we began to explore in R Practice 3.4 on Multivariate Regression. Download and read in (read_csv) the data below.

This data again comes from a paper by Makowsky and Strattman (2009). Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:

Variable Description
Amount Amount of fine (in dollars) assessed for speeding
Age Age of speeding driver (in years)
MPHover Miles per hour over the speed limit
Black Dummy $$=1$$ if driver was black, $$=0$$ if not
Hispanic Dummy $$=1$$ if driver was Hispanic, $$=0$$ if not
Female Dummy $$=1$$ if driver was female, $$=0$$ if not
OutTown Dummy $$=1$$ if driver was not from local town, $$=0$$ if not
OutState Dummy $$=1$$ if driver was not from local state, $$=0$$ if not
StatePol Dummy $$=1$$ if driver was stopped by State Police, $$=0$$ if stopped by other (local)

We again want to explore who gets fines, and how much.

Question 2

We will have to do a little bit of cleaning to get the data in a more usable form.

Part A

Inspect the data with str() or head() or glimpse() to see what it looks like.

Part B

What class of variable are Black, Hispanic, Female, OutTown, and OutState?Hint use the class(df\$variable) command to ask what class something is, where df is your dataframe, and variable is the name of a variable.

Part C

Notice that when importing the data from the .csv file, R interpretted these variables as integer, but we want them to be factor variables, to ensure R recognizes that there are two groups (categories), 0 and 1. Convert each of these variables into factors by reassigning it according to the format:

df<-df %>%
mutate(my_var=as.factor(my_var),
my_var2=as.factor(my_var2))

where

• df is the name of your dataframe
• my_var and my_var2 are example variablesAs a bonus, you can try doing this with just one conditional command: mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"),factor). See our Data Wrangling slides for refreshers of all the fancy mutate() possibilities!

Part D

Confirm they are each now factors by checking their class again.

Part E

Get a summary() of Amount. Note that there are a lot of NA’s (these are people that were stopped but did not recieve a fine)! filter() only those observations for which Amount is a positive number, and save this in your dataframe (assign and overwrite it, or make a new dataframe). Then double check that this worked by getting a summary() of Amount again.

Question 3

Create a scatterplot between Amount (as Y) and Female (as X).Hint: Use geom_jitter() instead of geom_point() to plot the points, and play around with width settings inside geom_jitter()

Question 4

Now let’s start looking at the distribution conditionally to find the different group means.

Part A

Find the mean and standard deviationHint: use the summarize() command, once you have properly filter()ed the data.

of Amount for male drivers and again for female drivers.

Part B

What is the difference between the average Amount for Males and Females?

Part C

We did not go over how to do this in class, but you can run a t-test for the difference in group means to see if the difference is statistically significant. The syntax is similar for a regression:

t.test(y~d, data=df)

where:

• y is the variable we are testing
• d is the dummy variable

Is there a statistically significant difference between Amount for male and female drivers?Hint: this is like any hypothesis test. A $$t$$-value needs to be large enough to be greater than a critical value of $$t$$. Check the $$p$$-value and see if it is less than our standard of $$\alpha=0.05.$$

Question 5

Now run the following regression to ensure we get the same result

$\text{Amount}_i=\hat{\beta_0}+\hat{\beta_1}Female_i$

Part A

Write out the estimated regression equation.

Part B

Use the regression coefficients to find

1. the average Amount for men
1. the average Amount for women
1. the difference in average Amount between men and women

Question 6

Let’s recode the sex variable.

Part A

Make a new variable called Male and save it in your dataframe.Hint: mutate() and define Male equal to factor(ifelse()). This makes the variable a factor (so we don’t have to later), and the ifelse() function has three arguments: 1. test = the condition to test, 2. yes = what to define “Male” as when the condition is TRUE, and 3. no = what to define “Male” as when the condition is FALSE.

Part B

Run the same regression as in question 5, but use Male instead of Female.

Part C

Write out the estimated regression equation.

Part D

Use the regression coefficients to find

1. the average Amount for men
1. the average Amount for women
1. the difference in average Amount between men and women

Question 7

Run a regression of Amount on Male and Female. What happens, and why?

Question 8

Age probably has a lot to do with differences in fines, perhaps also age affects fines differences between males and females.

Part A

Run a regression of Amount on Age and Female. How does the coefficient on Female change?

Part B

Now let’s see if the difference in fine between men and women are different depending on the driver’s age. Run the regression again, but add an interaction term between Female and Age interaction term.

Part C

Write out your estimated regression equation.

Part D

Interpret the interaction effect. Is it statistically significant?

Part E

Plugging in 0 or 1 as necessary, rewrite (on your paper) this regression as two separate equations, one for Males and one for Females.

Part F

Let’s try to visualize this. Make a scatterplot of Age (X) and Amount (Y) and include a regression line.

Part G

Try adding to your base layer aes(), set color=Female. This will produce two lines and color the points by Female.Sometimes we may also need to remind R that Female is a factor with as.factor(Female). We don’t need to in this case because we already reset Female as a faction in question 1.

Part H

Add a facet layer to make two different scatterplots with an additional layer +facet_wrap(~Female)

Question 9

Now let’s look at the possible interaction between Sex (Male or Female) and whether a driver is from In-State or Out-of-State (OutState).

Part A

Use R to examine the data and find the mean for (i) Males In-State, (ii) Males Out-of-State, (iii) Females In-State, and (iv) Females Out-of-State.

Part B

Now run a regression of the following model:

$\text{Amount}_i=\hat{\beta_0}+\hat{\beta_1}Female_i+\hat{\beta_2}OutState_i+\hat{\beta_3}Female_i*OutState_i$

Part C

Write out the estimated regression equation.

Part D

What does each coefficient mean?

Part E

Using the regression equation, what are the means for

1. Males In-State
1. Males Out-of-State
1. Females In-State
1. Females Out-of-State?

Collect your regressions from questions 5, 6b, 8a, 8b, and 9b and output them in a regression table with huxtable().