--- title: "3.7 — Regression with Interaction Effects - R Practice" author: "YOUR NAME" date: "`r Sys.Date()`" output: html_document: df_print: paged #theme: toc: true toc_depth: 3 toc_float: true code_folding: show highlight: tango --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Question 1 We are returning to the speeding tickets data that we began to explore in [R Practice 3.4 on Multivariate Regression](http://metricsf20.classes.ryansafner.com/r/3.4-r-practice). Download and read in (`read_csv`) the `speeding_tickets.csv` data again. - [ `speeding_tickets.csv`](https://metricsf20.classes.ryansafner.com/data/speeding_tickets.csv) This data again comes from a paper by Makowsky and Strattman (2009). Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one's we'll look at are: | Variable | Description | |----------|-------------| | `Amount` | Amount of fine (in dollars) assessed for speeding | | `Age` | Age of speeding driver (in years) | | `MPHover` | Miles per hour over the speed limit | | `Black` | Dummy $=1$ if driver was black, $=0$ if not | | `Hispanic` | Dummy $=1$ if driver was Hispanic, $=0$ if not | | `Female` | Dummy $=1$ if driver was female, $=0$ if not | | `OutTown` | Dummy $=1$ if driver was not from local town, $=0$ if not | | `OutState` | Dummy $=1$ if driver was not from local state, $=0$ if not | | `StatePol` | Dummy $=1$ if driver was stopped by State Police, $=0$ if stopped by other (local) | > We again want to explore who gets fines, and how much. --- ```{r} # PUT CODE HERE ``` --- ## Question 2 We will have to do a little bit of cleaning to get the data in a more usable form. ### Part A Inspect the data with `str()` or `head()` or `glimpse()` to see what it looks like. --- ```{r} # PUT CODE HERE ``` --- ### Part B What `class` of variable are `Black`, `Hispanic`, `Female`, `OutTown`, and `OutState`?^[Hint use the `class(df$variable)` command to ask what class something is, where `df` is your dataframe, and `variable` is the name of a variable.] --- ```{r} # PUT CODE HERE ``` --- ### Part C Notice that when importing the data from the `.csv` file, `R` interpretted these variables as `integer`, but we want them to be `factor` variables, to ensure `R` recognizes that there are two groups (categories), 0 and 1. Convert each of these variables into factors by reassigning it according to the format: ```{r, echo = T, eval = F} df<-df %>% mutate(my_var=as.factor(my_var), my_var2=as.factor(my_var2)) ``` where - `df` is the name of your dataframe - `my_var` and `my_var2` are example variables^[As a bonus, you can try doing this with just one conditional command: `mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"),factor)`. See our [Data Wrangling slides](https://metricsf19.classes.ryansafner.com/slides/05-slides#95) for refreshers of all the fancy `mutate()` possibilities!] --- ```{r} # PUT CODE HERE ``` --- ### Part D Confirm they are each now factors by checking their class again. --- ```{r} # PUT CODE HERE ``` --- ### Part E Get a `summary()` of `Amount`. Note that there are a lot of `NA`'s (these are people that were stopped but did not recieve a fine)! `filter()` only those observations for which `Amount` is a positive number, and save this in your dataframe (assign and overwrite it, or make a new dataframe). Then double check that this worked by getting a `summary()` of `Amount` again. ```{r} # PUT CODE HERE ``` --- ## Question 3 Create a scatterplot between `Amount` (as Y) and Female (as X).^[Hint: Use `geom_jitter()` instead of `geom_point()` to plot the points, and play around with `width` settings inside `geom_jitter()`] --- ```{r} # PUT CODE HERE ``` --- ## Question 4 Now let's start looking at the distribution conditionally to find the different group means. ### Part A Find the mean and standard deviation^[Hint: use the `summarize()` command, once you have properly `filter()`ed the data.] of `Amount` for *male* drivers and again for *female* drivers. --- ```{r} # PUT CODE HERE ``` --- ### Part B What is the difference between the average Amount for Males and Females? --- ```{r} # PUT CODE HERE ``` --- ### Part C We did not go over how to do this in class, but you can run a **t-test for the difference in group means** to see if the difference is statistically significant. The syntax is similar for a regression: `t.test(y~d, data=df)` where: - `y` is the variable we are testing - `d` is the dummy variable Is there a statistically significant difference between Amount for male and female drivers?^[Hint: this is like any hypothesis test. A $t$-value needs to be large enough to be greater than a critical value of $t$. Check the $p$-value and see if it is less than our standard of $\alpha=0.05.$] --- ```{r} # PUT CODE HERE ``` --- ## Question 5 Now run the following regression to ensure we get the same result $$\text{Amount}_i=\hat{\beta_0}+\hat{\beta_1}Female_i$$ --- ```{r} # PUT CODE HERE ``` --- ### Part A Write out the estimated regression equation. --- ```{r} # PUT CODE HERE ``` --- ### Part B Use the regression coefficients to find - (i) the average `Amount` for men - (ii) the average `Amount` for women - (iii) the difference in average `Amount` between men and women --- ```{r} # PUT CODE HERE ``` --- ## Question 6 Let's recode the sex variable. ### Part A Make a new variable called `Male` and save it in your dataframe.^[Hint: `mutate()` and define `Male` equal to `factor(ifelse())`. This makes the variable a `factor` (so we don't have to later), and the `ifelse()` function has three arguments: 1. `test =` the condition to test, 2. `yes =` what to define "Male" as when the condition is `TRUE`, and 3. `no = ` what to define "Male" as when the condition is `FALSE`.] --- ```{r} # PUT CODE HERE ``` --- ### Part B Run the same regression as in question 5, but use `Male` instead of `Female`. --- ```{r} # PUT CODE HERE ``` --- ### Part C Write out the estimated regression equation. --- ```{r} # PUT CODE HERE ``` --- ### Part D Use the regression coefficients to find - (i) the average `Amount` for men - (ii) the average `Amount` for women - (iii) the difference in average `Amount` between men and women --- ```{r} # PUT CODE HERE ``` --- ## Question 7 Run a regression of `Amount` on `Male` and `Female`. What happens, and why? --- ```{r} # PUT CODE HERE ``` --- ## Question 8 `Age` probably has a lot to do with differences in fines, perhaps also age affects fines differences between males and females. ### Part A Run a regression of `Amount` on `Age` and `Female.` How does the coefficient on `Female` change? --- ```{r} # PUT CODE HERE ``` --- ### Part B Now let's see if the difference in fine between men and women are different depending on the driver's age. Run the regression again, but add an **interaction term** between `Female` and `Age` interaction term. --- ```{r} # PUT CODE HERE ``` --- ### Part C Write out your estimated regression equation. --- ```{r} # PUT CODE HERE ``` --- ### Part D Interpret the interaction effect. Is it statistically significant? --- ```{r} # PUT CODE HERE ``` --- ### Part E Plugging in 0 or 1 as necessary, rewrite (on your paper) this regression as *two separate* equations, one for Males and one for Females. --- ```{r} # PUT CODE HERE ``` --- ### Part F Let's try to visualize this. Make a scatterplot of `Age` (X) and `Amount` (Y) and include a regression line. --- ```{r} # PUT CODE HERE ``` --- ### Part G Try adding to your base layer `aes()`, set `color=Female`. This will produce two lines and color the points by `Female`.^[Sometimes we may also need to remind `R` that `Female` is a factor with `as.factor(Female)`. We don't need to in *this* case because we already reset `Female` as a faction in question 1.] --- ```{r} # PUT CODE HERE ``` --- ### Part H Add a facet layer to make two different scatterplots with an additional layer `+facet_wrap(~Female)` --- ```{r} # PUT CODE HERE ``` --- ## Question 9 Now let's look at the possible interaction between Sex (`Male` or `Female`) and whether a driver is from In-State or Out-of-State (`OutState`). ### Part A Use `R` to examine the data and find the mean for (i) Males In-State, (ii) Males Out-of-State, (iii) Females In-State, and (iv) Females Out-of-State. --- ```{r} # PUT CODE HERE ``` --- ### Part B Now run a regression of the following model: $$\text{Amount}_i=\hat{\beta_0}+\hat{\beta_1}Female_i+\hat{\beta_2}OutState_i+\hat{\beta_3}Female_i*OutState_i$$ --- ```{r} # PUT CODE HERE ``` ---- ### Part C Write out the estimated regression equation. --- ```{r} # PUT CODE HERE ``` --- ### Part D What does each coefficient mean? --- ```{r} # PUT CODE HERE ``` --- ### Part E Using the regression equation, what are the means for - (i) Males In-State - (ii) Males Out-of-State - (iii) Females In-State - (iv) Females Out-of-State? Compare to your answers in part A. --- ```{r} # PUT CODE HERE ``` --- ## Question 10 Collect your regressions from questions 5, 6b, 8a, 8b, and 9b and output them in a regression table with `huxtable()`. --- ```{r} # PUT CODE HERE ```