2.3 — OLS Linear Regression

# 2.3 — OLS Linear Regression
## ECON 480 • Econometrics • Fall 2020
### Ryan Safner<br> Assistant Professor of Economics <br> <a href="mailto:safner@hood.edu"><i class="fa fa-paper-plane fa-fw"></i>safner@hood.edu</a> <br> <a href="https://github.com/ryansafner/metricsF20"><i class="fa fa-github fa-fw"></i>ryansafner/metricsF20</a><br> <a href="https://metricsF20.classes.ryansafner.com"> <i class="fa fa-globe fa-fw"></i>metricsF20.classes.ryansafner.com</a><br>

---

---

# Bivariate Data and Relationships

.pull-left[
- We looked at single variables for descriptive statistics
- Most uses of statistics in economics and business investigate relationships *between* variables

.content-box-green[
.smaller[
.green[**Examples**]
  - \# of police & crime rates
  - healthcare spending & life expectancy
  - government spending & GDP growth
  - carbon dioxide emissions & temperatures
]
]
]
.pull-right[
.center[
![](../images/statsgraphs.jpg)
]
]

---

# Bivariate Data and Relationships

.pull-left[
- We will begin with .hi-purple[bivariate] data for relationships between `$X$` and `$Y$`

- Immediate aim is to explore .hi[associations] between variables, quantified with .hi[correlation] and .hi[linear regression]

- Later we want to develop more sophisticated tools to argue for .hi[causation]

]

---

# Bivariate Data: Spreadsheets I

```r
econfreedom <- read_csv("econfreedom.csv")
head(econfreedom)
```

]

```
## # A tibble: 6 x 6
##      X1 Country   ISO      ef    gdp continent
##   <dbl> <chr>     <chr> <dbl>  <dbl> <chr>    
## 1     1 Albania   ALB    7.4   4543. Europe   
## 2     2 Algeria   DZA    5.15  4784. Africa   
## 3     3 Angola    AGO    5.08  4153. Africa   
## 4     4 Argentina ARG    4.81 10502. Americas 
## 5     5 Australia AUS    7.93 54688. Oceania  
## 6     6 Austria   AUT    7.56 47604. Europe
```
]

- **Rows** are individual observations (countries)
- **Columns** are variables on all individuals

---

# Bivariate Data: Spreadsheets II

```r
econfreedom %>%
  glimpse()
```

```
## Rows: 112
## Columns: 6
## $ X1        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Country   <chr> "Albania", "Algeria", "Angola", "Argentina", "Australia", "…
## $ ISO       <chr> "ALB", "DZA", "AGO", "ARG", "AUS", "AUT", "BHR", "BGD", "BE…
## $ ef        <dbl> 7.40, 5.15, 5.08, 4.81, 7.93, 7.56, 7.60, 6.35, 7.51, 6.22,…
## $ gdp       <dbl> 4543.0880, 4784.1943, 4153.1463, 10501.6603, 54688.4459, 47…
## $ continent <chr> "Europe", "Africa", "Africa", "Americas", "Oceania", "Europ…
```

---

# Bivariate Data: Spreadsheets III

```r
source("summaries.R") # use my summary_table function

econfreedom %>%
    summary_table(ef, gdp)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:right;"> Obs </th>
   <th style="text-align:right;"> Min </th>
   <th style="text-align:right;"> Q1 </th>
   <th style="text-align:right;"> Median </th>
   <th style="text-align:right;"> Q3 </th>
   <th style="text-align:right;"> Max </th>
   <th style="text-align:right;"> Mean </th>
   <th style="text-align:right;"> Std. Dev. </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ef </td>
   <td style="text-align:right;"> 112 </td>
   <td style="text-align:right;"> 4.81 </td>
   <td style="text-align:right;"> 6.42 </td>
   <td style="text-align:right;"> 7.0 </td>
   <td style="text-align:right;"> 7.40 </td>
   <td style="text-align:right;"> 8.71 </td>
   <td style="text-align:right;"> 6.86 </td>
   <td style="text-align:right;"> 0.78 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gdp </td>
   <td style="text-align:right;"> 112 </td>
   <td style="text-align:right;"> 206.71 </td>
   <td style="text-align:right;"> 1307.46 </td>
   <td style="text-align:right;"> 5123.3 </td>
   <td style="text-align:right;"> 17302.66 </td>
   <td style="text-align:right;"> 89590.81 </td>
   <td style="text-align:right;"> 14488.49 </td>
   <td style="text-align:right;"> 19523.54 </td>
  </tr>
</tbody>
</table>

---

# Bivariate Data: Scatterplots

- The best way to visualize an association between two variables is with a .hi[scatterplot]

- Each point: pair of variable values `$(x_i,y_i) \in X, Y$` for observation `$i$`

```r
library("ggplot2")
ggplot(data = econfreedom)+
  aes(x = ef,
      y = gdp)+
  geom_point(aes(color = continent),
             size = 2)+
  labs(x = "Economic Freedom Index (2014)",
       y = "GDP per Capita (2014 USD)",
       color = "")+
  scale_y_continuous(labels = scales::dollar)+
  theme_pander(base_family = "Fira Sans Condensed",
           base_size=20)+
  theme(legend.position = "bottom")
```
]
]

.pull-right[
<img src="2.3-slides_files/figure-html/econfreedom-scatter-out-1.png" width="504" style="display: block; margin: auto;" />

]

---

# Associations

- Look for .hi-purple[association] between independent and dependent variables

1. **Direction**: is the trend positive or negative?

2. **Form**: is the trend linear, quadratic, something else, or no pattern?

3. **Strength**: is the association strong or weak?

4. **Outliers**: do any observations break the trends above?

]

.pull-right[
<img src="2.3-slides_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" />

]

---

# Quantifying Relationships

---

# Covariance

- For any two variables, we can measure their .hi[sample covariance, `\$cov(X,Y)\$`] or .hi[`\$s_{X,Y}\$`] to quantify how they vary *together*<sup>.magenta[†]</sup>

`$$s_{X,Y}=E\big[(X-\bar{X})(Y-\bar{Y}) \big]$$`

- Intuition: if `$X$` is above its mean, would we expect `$Y$`:
    - to be *above* its mean also `$(X$` and `$Y$` covary *positively*)
    - to be *below* its mean `$(X$` and `$Y$` covary *negatively*)

- Covariance is a common measure, but the units are meaningless, thus we rarely need to use it so **don't worry about learning the formula**

.footnote[<sup>.magenta[†]</sup> Henceforth we limit all measures to *samples*, for convenience. Population covariance is denoted `\$\sigma_{X,Y}\$`]

---

# Covariance, in R

```r
# base R 
cov(econfreedom$ef,econfreedom$gdp)
```

```
## [1] 8922.933
```

```r
# dplyr

econfreedom %>%
  summarize(cov = cov(ef,gdp))
```

```
## # A tibble: 1 x 1
##     cov
##   <dbl>
## 1 8923.
```

8923 what, exactly?

---

# Correlation

.smallest[
- More convenient to *standardize* covariance into a more intuitive concept: .hi[correlation, `\$\rho\$`] or .hi[`\$r\$`] `$\in [-1, 1]$`

]

.smallest[
- Alternatively, take the average<sup>.magenta[†]</sup> of the product of standardized `$(Z$`-scores for) each `$(x_i,y_i)$` pair:<sup>.magenta[‡]</sup>
]

.smallest[
`$$\begin{align*}
r&=\frac{1}{n-1}\sum^n_{i=1}\bigg(\frac{x_i-\bar{X}}{s_X}\bigg)\bigg(\frac{y_i-\bar{Y}}{s_Y}\bigg)\\
r&=\frac{1}{n-1}\sum^n_{i=1}Z_XZ_Y\\
\end{align*}$$`

]

<sup>.magenta[‡]</sup> See today's [class notes page](/class/2.3-class) for example code to calculate correlation "by hand" in R using the second method.]

---

# Correlation: Interpretation

- Correlation is standardized to 
`$$-1 \leq r \leq 1$$`
.smallest[
- Negative values `$\implies$` negative association

- Positive values `$\implies$` positive association

- Correlation of 0 `$\implies$` no association

- As `$|r| \rightarrow 1 \implies$` the stronger the association

- Correlation of `$|r|=1 \implies$` perfectly linear
]
]

![](2.3-slides_files/figure-html/unnamed-chunk-8-1.png)
]

---

# Guess the Correlation!

[Guess the Correlation Game](http://guessthecorrelation.com/index.html)
]

---

# Correlation and Covariance in R

```r
# Base r: cov or cor(df$x, df$y)

cov(econfreedom$ef, econfreedom$gdp)
```

```
## [1] 8922.933
```

```r
cor(econfreedom$ef, econfreedom$gdp)
```

```
## [1] 0.5867018
```
]

```r
# tidyverse method

econfreedom %>%
  summarize(covariance = cov(ef, gdp),
            correlation = cor(ef, gdp))
```

```
## # A tibble: 1 x 2
##   covariance correlation
##        <dbl>       <dbl>
## 1      8923.       0.587
```
]

---

# Correlation and Covariance in R I

- `corrplot` is a great package (install and then load) to **visualize** correlations in data

```r
library(corrplot) # see more at https://github.com/taiyun/corrplot
library(RColorBrewer) # for color scheme used here
library(gapminder) # for gapminder data

# need to make a corelation matrix with cor(); can only include numeric variables
gapminder_cor<- gapminder %>%
  dplyr::select(gdpPercap, pop, lifeExp)

# make a correlation table with cor (base R)
gapminder_cor_table<-cor(gapminder_cor)

# view it
gapminder_cor_table
```

```
##             gdpPercap         pop    lifeExp
## gdpPercap  1.00000000 -0.02559958 0.58370622
## pop       -0.02559958  1.00000000 0.06495537
## lifeExp    0.58370622  0.06495537 1.00000000
```

---

# Correlation and Covariance in R II

```r
corrplot(gapminder_cor_table, type="upper", 
         method = "circle", 
         order = "alphabet", 
         col = viridis::viridis(100)) # custom color
```
]

--
  
.pull-right[

<img src="2.3-slides_files/figure-html/unnamed-chunk-14-1.png" width="504" />
]

---

# Correlation and Endogeneity

- Your Occasional Reminder: .hi[Correlation does not imply causation!]
    - I'll show you the difference in a few weeks (when we can actually talk about causation)

- If `$X$` and `$Y$` are strongly correlated, `$X$` can still be .hi-purple[endogenous]!

- See [today's class notes page](/class/2.2-class) for more on Covariance and Correlation

]

---

# Always Plot Your Data!

---

# Linear Regression

---

# Fitting a Line to Data

- If an association appears linear, we can estimate the equation of a line that would "fit" the data

`$$Y = a + bX$$`

- Recall a linear equation describing a line contains:<sup>.magenta
    - `$a$`: vertical intercept
    - `$b$`: slope

]

]

---

# Fitting a Line to Data

- If an association appears linear, we can estimate the equation of a line that would "fit" the data

`$$Y = a + bX$$`

- Recall a linear equation describing a line contains:
    - `$a$`: vertical intercept
    - `$b$`: slope

- How do we choose the equation that *best* fits the data?

]

]

---

# Fitting a Line to Data

- If an association appears linear, we can estimate the equation of a line that would "fit" the data

`$$Y = a + bX$$`

- Recall a linear equation describing a line contains:
    - `$a$`: vertical intercept
    - `$b$`: slope

- How do we choose the equation that *best* fits the data?

- This process is called .hi[linear regression]

]

]

---

# Population Linear Regression Model

- Linear regression lets us estimate the slope of the .hi-purple[population] regression line between `$X$` and `$Y$` using .hi-purple[sample] data

- We can make .hi-purple[statistical inferences] about the population slope coefficient
  - eventually & hopefully: a .hi-purple[*causal* inference]

- `$\text{slope}=\frac{\Delta Y}{\Delta X}$`: for a 1-unit change in `$X$`, how many units will this *cause* `$Y$` to change?

---

# Class Size Example

.content-box-green[
.green[**Example**]: What is the relationship between class size and educational performance?
]

]

---

# Class Size Example: Load the Data

```r
# install.packages("haven") # install for first use
library("haven") # load for importing .dta files
CASchool<-read_dta("../data/caschool.dta")
```

---

# Class Size Example: Look at the Data I

```r
glimpse(CASchool)
```

```
## Rows: 420
## Columns: 21
## $ observat <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, 63834, 6233…
## $ county   <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "Fresno", "Sa…
## $ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito Un…
## $ gr_span  <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08…
## $ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, 98…
## $ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.00…
## $ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.9032…
## $ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.6237…
## $ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 31,…
## $ testscr  <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 606.75, 609.…
## $ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0.12808989, …
## $ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988, 5580.147, …
## $ str      <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133, 21.40625, …
## $ avginc   <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333, 10.415000…
## $ el_pct   <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.857677, 12.40875…
## $ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 608.…
## $ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 616.…
## $ aowijef  <dbl> 35.77982, 43.04933, 37.39445, 34.71429, 37.34266, 42.81250, …
## $ es_pct   <dbl> 1.000000, 3.583333, 29.000002, 1.000000, 12.857677, 11.40875…
## $ es_frac  <dbl> 0.01000000, 0.03583334, 0.29000002, 0.01000000, 0.12857677, …
```

---

# Class Size Example: Look at the Data II

<table class="table" style="font-size: 8px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> observat </th>
   <th style="text-align:right;"> dist_cod </th>
   <th style="text-align:left;"> county </th>
   <th style="text-align:left;"> district </th>
   <th style="text-align:left;"> gr_span </th>
   <th style="text-align:right;"> enrl_tot </th>
   <th style="text-align:right;"> teachers </th>
   <th style="text-align:right;"> calw_pct </th>
   <th style="text-align:right;"> meal_pct </th>
   <th style="text-align:right;"> computer </th>
   <th style="text-align:right;"> testscr </th>
   <th style="text-align:right;"> comp_stu </th>
   <th style="text-align:right;"> expn_stu </th>
   <th style="text-align:right;"> str </th>
   <th style="text-align:right;"> avginc </th>
   <th style="text-align:right;"> el_pct </th>
   <th style="text-align:right;"> read_scr </th>
   <th style="text-align:right;"> math_scr </th>
   <th style="text-align:right;"> aowijef </th>
   <th style="text-align:right;"> es_pct </th>
   <th style="text-align:right;"> es_frac </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 75119 </td>
   <td style="text-align:left;"> Alameda </td>
   <td style="text-align:left;"> Sunol Glen Unified </td>
   <td style="text-align:left;"> KK-08 </td>
   <td style="text-align:right;"> 195 </td>
   <td style="text-align:right;"> 10.90 </td>
   <td style="text-align:right;"> 0.5102 </td>
   <td style="text-align:right;"> 2.0408 </td>
   <td style="text-align:right;"> 67 </td>
   <td style="text-align:right;"> 690.80 </td>
   <td style="text-align:right;"> 0.3435898 </td>
   <td style="text-align:right;"> 6384.911 </td>
   <td style="text-align:right;"> 17.88991 </td>
   <td style="text-align:right;"> 22.690001 </td>
   <td style="text-align:right;"> 0.000000 </td>
   <td style="text-align:right;"> 691.6 </td>
   <td style="text-align:right;"> 690.0 </td>
   <td style="text-align:right;"> 35.77982 </td>
   <td style="text-align:right;"> 1.000000 </td>
   <td style="text-align:right;"> 0.0100000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 61499 </td>
   <td style="text-align:left;"> Butte </td>
   <td style="text-align:left;"> Manzanita Elementary </td>
   <td style="text-align:left;"> KK-08 </td>
   <td style="text-align:right;"> 240 </td>
   <td style="text-align:right;"> 11.15 </td>
   <td style="text-align:right;"> 15.4167 </td>
   <td style="text-align:right;"> 47.9167 </td>
   <td style="text-align:right;"> 101 </td>
   <td style="text-align:right;"> 661.20 </td>
   <td style="text-align:right;"> 0.4208333 </td>
   <td style="text-align:right;"> 5099.381 </td>
   <td style="text-align:right;"> 21.52466 </td>
   <td style="text-align:right;"> 9.824000 </td>
   <td style="text-align:right;"> 4.583334 </td>
   <td style="text-align:right;"> 660.5 </td>
   <td style="text-align:right;"> 661.9 </td>
   <td style="text-align:right;"> 43.04933 </td>
   <td style="text-align:right;"> 3.583334 </td>
   <td style="text-align:right;"> 0.0358333 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 61549 </td>
   <td style="text-align:left;"> Butte </td>
   <td style="text-align:left;"> Thermalito Union Elementary </td>
   <td style="text-align:left;"> KK-08 </td>
   <td style="text-align:right;"> 1550 </td>
   <td style="text-align:right;"> 82.90 </td>
   <td style="text-align:right;"> 55.0323 </td>
   <td style="text-align:right;"> 76.3226 </td>
   <td style="text-align:right;"> 169 </td>
   <td style="text-align:right;"> 643.60 </td>
   <td style="text-align:right;"> 0.1090323 </td>
   <td style="text-align:right;"> 5501.955 </td>
   <td style="text-align:right;"> 18.69723 </td>
   <td style="text-align:right;"> 8.978000 </td>
   <td style="text-align:right;"> 30.000002 </td>
   <td style="text-align:right;"> 636.3 </td>
   <td style="text-align:right;"> 650.9 </td>
   <td style="text-align:right;"> 37.39445 </td>
   <td style="text-align:right;"> 29.000002 </td>
   <td style="text-align:right;"> 0.2900000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 61457 </td>
   <td style="text-align:left;"> Butte </td>
   <td style="text-align:left;"> Golden Feather Union Elementary </td>
   <td style="text-align:left;"> KK-08 </td>
   <td style="text-align:right;"> 243 </td>
   <td style="text-align:right;"> 14.00 </td>
   <td style="text-align:right;"> 36.4754 </td>
   <td style="text-align:right;"> 77.0492 </td>
   <td style="text-align:right;"> 85 </td>
   <td style="text-align:right;"> 647.70 </td>
   <td style="text-align:right;"> 0.3497942 </td>
   <td style="text-align:right;"> 7101.831 </td>
   <td style="text-align:right;"> 17.35714 </td>
   <td style="text-align:right;"> 8.978000 </td>
   <td style="text-align:right;"> 0.000000 </td>
   <td style="text-align:right;"> 651.9 </td>
   <td style="text-align:right;"> 643.5 </td>
   <td style="text-align:right;"> 34.71429 </td>
   <td style="text-align:right;"> 1.000000 </td>
   <td style="text-align:right;"> 0.0100000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 61523 </td>
   <td style="text-align:left;"> Butte </td>
   <td style="text-align:left;"> Palermo Union Elementary </td>
   <td style="text-align:left;"> KK-08 </td>
   <td style="text-align:right;"> 1335 </td>
   <td style="text-align:right;"> 71.50 </td>
   <td style="text-align:right;"> 33.1086 </td>
   <td style="text-align:right;"> 78.4270 </td>
   <td style="text-align:right;"> 171 </td>
   <td style="text-align:right;"> 640.85 </td>
   <td style="text-align:right;"> 0.1280899 </td>
   <td style="text-align:right;"> 5235.988 </td>
   <td style="text-align:right;"> 18.67133 </td>
   <td style="text-align:right;"> 9.080333 </td>
   <td style="text-align:right;"> 13.857677 </td>
   <td style="text-align:right;"> 641.8 </td>
   <td style="text-align:right;"> 639.9 </td>
   <td style="text-align:right;"> 37.34266 </td>
   <td style="text-align:right;"> 12.857677 </td>
   <td style="text-align:right;"> 0.1285768 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 62042 </td>
   <td style="text-align:left;"> Fresno </td>
   <td style="text-align:left;"> Burrel Union Elementary </td>
   <td style="text-align:left;"> KK-08 </td>
   <td style="text-align:right;"> 137 </td>
   <td style="text-align:right;"> 6.40 </td>
   <td style="text-align:right;"> 12.3188 </td>
   <td style="text-align:right;"> 86.9565 </td>
   <td style="text-align:right;"> 25 </td>
   <td style="text-align:right;"> 605.55 </td>
   <td style="text-align:right;"> 0.1824818 </td>
   <td style="text-align:right;"> 5580.147 </td>
   <td style="text-align:right;"> 21.40625 </td>
   <td style="text-align:right;"> 10.415000 </td>
   <td style="text-align:right;"> 12.408759 </td>
   <td style="text-align:right;"> 605.7 </td>
   <td style="text-align:right;"> 605.4 </td>
   <td style="text-align:right;"> 42.81250 </td>
   <td style="text-align:right;"> 11.408759 </td>
   <td style="text-align:right;"> 0.1140876 </td>
  </tr>
</tbody>
</table>

---

# Class Size Example: Scatterplot

```r
scatter <- ggplot(data = CASchool)+
  aes(x = str,
      y = testscr)+
  geom_point(color = "blue")+
  labs(x = "Student to Teacher Ratio",
       y = "Test Score")+
  theme_pander(base_family = "Fira Sans Condensed",
           base_size = 20)
scatter
```
]
]

---

# Class Size Example: Slope I

.pull-left[
.smaller[
- If we *change* `$(\Delta)$` the class size by an amount, what would we expect the *change* in test scores to be?

`$$\beta = \frac{\text{change in test score}}{\text{change in class size}} = \frac{\Delta \text{test score}}{\Delta \text{class size}}$$`

- If we knew `$\beta$`, we could say that changing class size by 1 student will change test scores by `$\beta$`

]
]

---

# Class Size Example: Slope II

`$$\Delta \text{test score} = \beta \times \Delta \text{class size}$$`

]
]
.pull-right[

---

# Class Size Example: Slope II

`$$\Delta \text{test score} = \beta \times \Delta \text{class size}$$`

- Suppose `$\beta=-0.6$`. If we shrank class size by 2 students, our model predicts:

`$$\begin{align*}
	\Delta \text{test score} &= -2 \times \beta\\
	\Delta \text{test score} &= -2 \times -0.6\\
	\Delta \text{test score}&= 1.2	\\
\end{align*}$$`

]
]
.pull-right[

---

# Class Size Example: Slope and Average Effect

- The line relating class size and test scores has the above equation

- `$\beta_0$` is the .hi[vertical-intercept], test score where class size is 0

- `$\beta_{1}$` is the .hi[slope] of the regression line

- This relationship only holds .hi-purple[on average] for all districts in the population, *individual* districts are also affected by other factors

]
]
.pull-right[

---

# Class Size Example: Marginal Effects

.pull-left[
.smaller[
- To get an equation that holds for *each* district, we need to include other factors

`$$\text{test score} = \beta_0 + \beta_1 \text{class size}+\text{other factors}$$`

- For now, we will ignore these until Unit III

- Thus, `$\beta_0 + \beta_1 \text{class size}$` gives the .hi-purple[average effect] of class sizes on scores

- Later, we will want to estimate the .hi[marginal effect] (.hi[causal effect]) of each factor on an individual district's test score, holding all other factors constant

]
]
.pull-right[

---

# Econometric Models Overview

`$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + u$$`

.smallest[
- `$Y$` is the .hi[dependent variable] of interest
    - AKA "response variable," "regressand," "Left-hand side (LHS) variable"
]

.smallest[
- `$X_1$` and `$X_2$` are .hi[independent variables]
    - AKA "explanatory variables", "regressors," "Right-hand side (RHS) variables", "covariates"
]
--

.smallest[
- To model, we .hi-turquoise["regress Y on `\$X_1\$` and `\$X_2\$`"]
]
--
.smallest[
- `$\beta_0$` and `$\beta_1$` are .hi-purple[parameters] that describe the population relationships between the variables
    - unknown! to be estimated!
]
--
.smallest[
- `$u$` is the random .hi[error term]
    - **'U'nobservable**, we can't measure it, and must model with assumptions about it
]
---

# The Population Regression Model

- How do we draw a line through the scatterplot? We do not know the **"true"** `$\beta_0$` or `$\beta_1$`

- We do have data from a *sample* of class sizes and test scores<sup>.magenta[†]</sup>

- So the real question is, .hi-purple[how can we estimate `\$\beta_0\$` and `\$\beta_1\$`?]

<sup>.magenta[†]</sup> Data are student-teacher-ratio and average test scores on Stanford 9 Achievement Test for 5th grade students for 420 K-6 and K-8 school districts in California in 1999, (Stock and Watson, 2015: p. 141)]

]

---

# Deriving OLS

---

# Deriving OLS

![](2.3-slides_files/figure-html/unnamed-chunk-22-1.png)

]

---

# Deriving OLS

![](2.3-slides_files/figure-html/unnamed-chunk-23-1.png)

]

---

# Deriving OLS

.pull-left[
.smallest[
- Suppose we have some data points
- We add a line
- The .hi[residual], `$\hat{u}$` of each data point is the difference between the .hi-purple[actual] and the .hi-purple[predicted] value of `$Y$` given `$X$`:

`$$u_i = Y_i - \hat{Y_i}$$`

]
]
.pull-right[

![](2.3-slides_files/figure-html/unnamed-chunk-24-1.png)

]

---

# Deriving OLS

`$$u_i = Y_i - \hat{Y_i}$$`

- We square each residual
]
]
.pull-right[

![](2.3-slides_files/figure-html/unnamed-chunk-25-1.png)

]

---

# Deriving OLS

`$$u_i = Y_i - \hat{Y_i}$$`

- We square each residual
- Add all of these up: .hi[Sum of Squared Errors (SSE)]

`$$SSE = \sum^n_{i=1} u_i^2$$`

]

![](2.3-slides_files/figure-html/unnamed-chunk-26-1.png)

]

---

# Deriving OLS

`$$u_i = Y_i - \hat{Y_i}$$`

- We square each residual
- Add all of these up: .hi[Sum of Squared Errors (SSE)]

`$$SSE = \sum^n_{i=1} u_i^2$$`

- .hi-purple[The line of best fit *minimizes* SSE]

]
]
.pull-right[

![](2.3-slides_files/figure-html/unnamed-chunk-27-1.png)

]

---

# *O* rdinary *L* east *S* quares Estimators

- The .hi[Ordinary Least Squares (OLS) estimators] of the unknown population parameters `$\beta_0$` and `$\beta_1$`, solve the calculus problem:

`$$\min_{\beta_0, \beta_1} \sum^n_{i=1}[\underbrace{Y_i-(\underbrace{\beta_0+\beta_1 X_i}_{\hat{Y_i}})}_{u}]^2$$`

- Intuitively, OLS estimators .hi-purple[minimize the average squared distance between the actual values `\$(Y_i)\$` and the predicted values `\$(\hat{Y}_i)\$` along the estimated regression line]

---

# The OLS Regression Line

- The .hi[OLS regression line] or .hi[sample regression line] is the linear function constructed using the OLS estimators:

`$$\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i$$`

- `$\hat{\beta_0}$` and `$\hat{\beta_1}$` ("beta 0 hat" & "beta 1 hat") are the .hi[OLS estimators] of population parameters `$\beta_0$` and `$\beta_1$` using sample data

- The .hi[predicted value] of Y given X, based on the regression, is `$E(Y_i|X_i)=\hat{Y_i}$`

- The .hi-purple[residual] or .hi-purple[prediction error] for the `$i^{th}$` observation is the difference between observed `$Y_i$` and its predicted value, `$\hat{u_i}=Y_i-\hat{Y_i}$`

---

# The OLS Regression Estimators

- The solution to the SSE minimization problem yields:<sup>.magenta[†]</sup>

`$$\hat{\beta}_0=\bar{Y}-\hat{\beta_1}\bar{X}$$`

`$$\hat{\beta}_1=\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})}{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2}=\frac{s_{XY}}{s^2_X}= \frac{cov(X,Y)}{var(X)}$$`

# Our Class Size Example in R

---

# Class Size Scatterplot (Again)

```r
scatter
```

- There is some true (unknown) population relationship: 
`$$\text{test score}=\beta_0+\beta_1 \times str$$`

- `$\beta_1=\frac{\Delta \text{test score}}{\Delta \text{str}}= ??$`

]

]

---

# Class SIze Scatterplot with Regression Line

```r
scatter+
* geom_smooth(method = "lm", color = "red")
```

]

]

---

# OLS in R

```r
# run regression of testscr on str
school_reg <- lm(testscr ~ str, 
                 data = CASchool)
```
]

- `y` is dependent variable (listed first!)

- `~` means "modeled by" or "explained by"

- `x` is the independent variable

- `df` is name of dataframe where data is stored
]
]

---

# OLS in R II

```r
# look at reg object
school_reg
```

```
## 
## Call:
## lm(formula = testscr ~ str, data = CASchool)
## 
## Coefficients:
## (Intercept)          str  
##      698.93        -2.28
```
]
.right-plot[
- Stored as an `lm` object called `school_reg`, a `list` object
]

---

# OLS in R III

- These objects are cumbersome, come from a much older, pre-`tidyverse` epoch of `base R`

- Luckily, we now have `tidy` ways of working with regressions!
]
]

```r
summary(school_reg) # get full summary
```

```
## 
## Call:
## lm(formula = testscr ~ str, data = CASchool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,	Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
```
]
]

---

# Tidy OLS in R: broom I

.left-column[
.center[
![](https://www.dropbox.com/s/qpe1604cf7s9r0g/rbroom.png?raw=1)
]
.footnote[<sup>.magenta[†]</sup> See more at [broom.tidyverse.org](https://broom.tidyverse.org/).]

]
.right-column[
.smaller[
- The `broom` package allows us to *tidy* up regression objects<sup>.magenta[†]</sup>

- The `tidy()` function creates a *tidy* `tibble` of regression output
]

```r
# load packages
library(broom)

# tidy regression output
tidy(school_reg)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   699.       9.47      73.8  6.57e-242
## 2 str            -2.28     0.480     -4.75 2.78e-  6
```
]

]

---

# Tidy OLS in R: broom II

.left-column[
.center[
![](https://www.dropbox.com/s/qpe1604cf7s9r0g/rbroom.png?raw=1)
]
.footnote[<sup>.magenta[†]</sup> See more at [broom.tidyverse.org](https://broom.tidyverse.org/).]

]
.right-column[
.smaller[
- The `broom` package allows us to *tidy* up regression objects<sup>.magenta[†]</sup>

- The `tidy()` function creates a *tidy* `tibble` of regression output
]

```r
# load packages
library(broom)

# tidy regression output (with confidence intervals!)
tidy(school_reg,
*    conf.int = TRUE)
```

```
## # A tibble: 2 x 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   699.       9.47      73.8  6.57e-242   680.      718.  
## 2 str            -2.28     0.480     -4.75 2.78e-  6    -3.22     -1.34
```

]
]

---

# More broom Tools: glance

- `glance()` shows us a lot of overall regression statistics and diagnostics
    - We'll interpret these in the next lecture and beyond

```r
# look at regression statistics and diagnostics
glance(school_reg)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0512        0.0490  18.6      22.6 2.78e-6     1 -1822. 3650. 3663.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---

# More broom Tools: augment

- `augment()` creates useful new variables in the stored `lm` object
    - `.fitted` are fitted (predicted) values from model, i.e. `$\hat{Y}_i$`
    - `.resid` are residuals (errors) from model, i.e. `$\hat{u}_i$`
]

```r
# add regression-based values to data
augment(school_reg)
```

```
## # A tibble: 420 x 8
##    testscr   str .fitted .resid .std.resid    .hat .sigma  .cooksd
##      <dbl> <dbl>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>    <dbl>
##  1    691.  17.9    658.   32.7      1.76  0.00442   18.5 0.00689 
##  2    661.  21.5    650.   11.3      0.612 0.00475   18.6 0.000893
##  3    644.  18.7    656.  -12.7     -0.685 0.00297   18.6 0.000700
##  4    648.  17.4    659.  -11.7     -0.629 0.00586   18.6 0.00117 
##  5    641.  18.7    656.  -15.5     -0.836 0.00301   18.6 0.00105 
##  6    606.  21.4    650.  -44.6     -2.40  0.00446   18.5 0.0130  
##  7    607.  19.5    654.  -47.7     -2.57  0.00239   18.5 0.00794 
##  8    609   20.9    651.  -42.3     -2.28  0.00343   18.5 0.00895 
##  9    612.  19.9    653.  -41.0     -2.21  0.00244   18.5 0.00597 
## 10    613.  20.8    652.  -38.9     -2.09  0.00329   18.5 0.00723 
## # … with 410 more rows
```
]
]

---

# Class Size Regression Result I

- Using OLS, we find:
`$$\widehat{\text{test score}}=689.9-2.28 \times str$$`

---

# Class Size Regression Result II

- There's a great package called `equatiomatic` that prints this equation in `markdown` or `$\LaTeX$`.

$$
\operatorname{testscr} = 698.93 - 2.28(\operatorname{str}) + \epsilon
$$

Here was my code:

```r
# install.packages("equatiomatic") # install for first use
library(equatiomatic) # load it
extract_eq(school_reg, # regression lm object
           use_coefs = TRUE, # use names of variables
           coef_digits = 2, # round to 2 digits
           fix_signs = TRUE) # fix negatives (instead of + -)
```

```
## $$
## \operatorname{testscr} = 698.93 - 2.28(\operatorname{str}) + \epsilon
## $$
```

- In `R` chunk in `R markdown`, set `{r, results="asis"}` to print this raw output to be rendered

---

# Class Size Regression: A Data Point

```r
CASchool %>%
  filter(district=="Richmond Elementary") %>%
  dplyr::select(district, testscr, str)
```

```
## # A tibble: 1 x 3
##   district            testscr   str
##   <chr>                 <dbl> <dbl>
## 1 Richmond Elementary    672.    22
```
]

.smallest[
- Predicted value:
`$$\widehat{\text{Test Score}}_{\text{Richmond}}=698-2.28(22) \approx 648$$`

- Residual
`$$\hat{u}_{Richmond}=672-648 \approx 24$$`

]
]